ANNOUNCE: text 0.8.0.0, fast Unicode text support

older
ANNOUNCE: Grempa 0.1.0, Embedded...

Bryan O'Sullivan

1 Sep 2010 1 Sep '10

4:08 a.m.

New in this release: - Substantial performance improvements. - Bug fixes, and better quality assurance via automated QuickCheck and HPC tests and coverage. - An even nicer API than before. The text library provides an efficient packed, immutable Unicode text type (both strict and lazy), with a powerful loop fusion optimization framework. The 'Text' type represents Unicode character strings, in a time and space-efficient manner. This package provides text processing capabilities that are optimized for performance critical use, both in terms of large data quantities and high speed. The text package provides character-encoding, type-safe case conversion via whole-string case conversion functions. It also provides a range of functions for converting Text values to and from ByteStrings, using several standard encodings (see the text-icu package for a much larger variety of encoding functions). Efficient locale-sensitive support for text I/O is also supported. http://www.serpentine.com/blog/2010/09/01/major-version-of-the-haskell-text-...

Attachments:

attachment.html (text/html — 1.3 KB)

Show replies by date

John Millikin

1 Sep 1 Sep

5:14 a.m.

Is there a summary of the API changes available? I see a new module, but Precis is choking on Data.Text and Data.Text.Lazy, so I'm not sure what existing signatures have been modified.

...

Don't forget, you can always improve the text library yourself. I love to receive patches, requests for improvement, and bug reports.

Are there any areas in particular you'd like help with, for either library? I'm happy to assist any effort which will help reduce use of String. [aside] does anybody know how to get a list of what packages somebody's uploaded to Hackage? I think I've updated all mine for the new "text" version dependency, but I'm worried I forgot some.

Ivan Lazar Miljenovic

6:01 a.m.

On 1 September 2010 15:14, John Millikin wrote:

...

[aside] does anybody know how to get a list of what packages somebody's uploaded to Hackage? I think I've updated all mine for the new "text" version dependency, but I'm worried I forgot some.

Unfortunately, no; I've often wished the same thing (not for my packages, but because I could have sworn someone uploaded a package that did something, I remembered the maintainer but not the name of the package...). -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Tako Schotanus

6:13 a.m.

On Wed, Sep 1, 2010 at 07:14, John Millikin wrote:

...

...
Don't forget, you can always improve the text library yourself. I love to receive patches, requests for improvement, and bug reports.

Are there any areas in particular you'd like help with, for either library? I'm happy to assist any effort which will help reduce use of String.

As a Haskell noob I'm curious about this statement, is there something intrinsically wrong with String? Or is it more a performance/resource problem when dealing with large amounts of text for example? (Like having to use StringBuilder in Java if you want to avoid the penalty of repeated String allocations when simply concatenating for example) Cheers, -Tako

David Virebayre

6:27 a.m.

2010/9/1 Tako Schotanus :

...

As a Haskell noob I'm curious about this statement, is there something intrinsically wrong with String?

String is just a linked list of Char which are unicode code points; which is probably not the optimal way to store text. For intensive use of text it takes too much memory, and/or it's not fast enough. Sometimes I'd love if I could program using String and the compiler would automatically convert that to Text, or Bytestrings, but of course it's wishful thinking. David.

Ivan Lazar Miljenovic

6:44 a.m.

On 1 September 2010 16:27, David Virebayre wrote:

...

2010/9/1 Tako Schotanus :

...
As a Haskell noob I'm curious about this statement, is there something intrinsically wrong with String?

String is just a linked list of Char which are unicode code points; which is probably not the optimal way to store text. For intensive use of text it takes too much memory, and/or it's not fast enough.

Sometimes I'd love if I could program using String and the compiler would automatically convert that to Text, or Bytestrings, but of course it's wishful thinking.

Well, there's OverloadedStrings for String literals... But the problem with automagic conversion would be one of encoding for Bytestrings at least. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Brandon S Allbery KF8NH

2 Sep 2 Sep

3:15 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/01/2010 02:44 AM, Ivan Lazar Miljenovic wrote:

...

On 1 September 2010 16:27, David Virebayre wrote:

...
Sometimes I'd love if I could program using String and the compiler would automatically convert that to Text, or Bytestrings, but of course it's wishful thinking. Well, there's OverloadedStrings for String literals...

But the problem with automagic conversion would be one of encoding for Bytestrings at least.

This reminds me: it would be nice if Text, ByteString, etc. exported their own definitions of "type String". As most users have to import them qualified to avoid collisions with Data.List (and the Prelude?) anyway, this would enable someone to import their desired module as, say, S, then use type S.String and string ops S.foo. Anyone who *did* use them unqualified would have to add (hiding (String)) to their imports, of course. The flip side of this would be a module which contained the definition for String and re-exported Data.List, so that the above S.mumble could work for String as well. (The obvious choice would be Data.String, but that's taken for IsString and of course that would be required to pull this off so it can't just have the list implementation stuffed into it.) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkx/FsQACgkQIn7hlCsL25Vk8ACeKco9/IFwGi+8gc+BxSyT3QwY oP8An3oabG1lbXHChShbenEWj9HWEq6Q =PQsO -----END PGP SIGNATURE-----

Kevin Jardine

1 Sep 1 Sep

6:31 a.m.

New subject: [Haskell-cafe] Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

Hi Tako, The issues involved with String, ByteString, Text and a few related libraries were discussed at great length recently in this thread: http://groups.google.com/group/haskell-cafe/browse_thread/thread/52a21cf61ff... Basically, Chars are 32 bit integers and Strings are represented as a list of Chars. This is very convenient for small computations but often very inefficient for anything large scale. The String API is also missing various encoding related features. Because of the limitations of String, various alternative libraries have been proposed. Text is one important option. You'll find much more detail on the above referenced thread. Kevin On Sep 1, 8:13 am, Tako Schotanus wrote:

...

On Wed, Sep 1, 2010 at 07:14, John Millikin wrote:

...
...
Don't forget, you can always improve the text library yourself. I love to receive patches, requests for improvement, and bug reports.

...
Are there any areas in particular you'd like help with, for either library? I'm happy to assist any effort which will help reduce use of String.

As a Haskell noob I'm curious about this statement, is there something intrinsically wrong with String? Or is it more a performance/resource problem when dealing with large amounts of text for example? (Like having to use StringBuilder in Java if you want to avoid the penalty of repeated String allocations when simply concatenating for example)

Cheers, -Tako

_______________________________________________ Haskell-Cafe mailing list Haskell-C...@haskell.orghttp://www.haskell.org/mailman/listinfo/haskell-cafe

Tako Schotanus

9:06 a.m.

New subject: ANNOUNCE: text 0.8.0.0, fast Unicode text support

Hi Kevin, thanks for the pointer, although I was aware of the thread and had followed it quite closely, it was quite interesting. But it never explained if and why String should be avoided, all I read is "test and decide depending on the circumstances", which in itself is good advise, but I'd like to have an idea of the reasons so I can form in idea before actually having to code any benchmarks :) Knowing that String literally is a linked list of Char makes it a lot clearer. I figured that maybe Haskell could be using some more efficient mechanism for Strings internally, only treating it outwardly as a [Char]. But I guess that in a lot of circumstances where you're just working with small pieces of text in non-performance critical code it's perfectly okay to use String. Cheers, -Tako On Wed, Sep 1, 2010 at 08:31, Kevin Jardine wrote:

...

Hi Tako,

The issues involved with String, ByteString, Text and a few related libraries were discussed at great length recently in this thread:

http://groups.google.com/group/haskell-cafe/browse_thread/thread/52a21cf61ff...

Basically, Chars are 32 bit integers and Strings are represented as a list of Chars.

This is very convenient for small computations but often very inefficient for anything large scale.

The String API is also missing various encoding related features.

Because of the limitations of String, various alternative libraries have been proposed. Text is one important option.

You'll find much more detail on the above referenced thread.

Kevin

...
On Wed, Sep 1, 2010 at 07:14, John Millikin wrote:

...
...
Don't forget, you can always improve the text library yourself. I love to receive patches, requests for improvement, and bug reports.

...
Are there any areas in particular you'd like help with, for either library? I'm happy to assist any effort which will help reduce use of String.

As a Haskell noob I'm curious about this statement, is there something intrinsically wrong with String? Or is it more a performance/resource problem when dealing with large amounts of text for example? (Like having to use StringBuilder in Java if you want to avoid the

On Sep 1, 8:13 am, Tako Schotanus wrote: penalty

...
of repeated String allocations when simply concatenating for example)

Cheers, -Tako

_______________________________________________ Haskell-Cafe mailing list Haskell-C...@haskell.orghttp:// www.haskell.org/mailman/listinfo/haskell-cafe

Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Bryan O'Sullivan

4:32 p.m.

On Tue, Aug 31, 2010 at 10:14 PM, John Millikin wrote:

...

Is there a summary of the API changes available? I see a new module, but Precis is choking on Data.Text and Data.Text.Lazy, so I'm not sure what existing signatures have been modified.

Ouch. I'll try to do a diff and follow up, probably not until much later.

...

Are there any areas in particular you'd like help with, for either library? I'm happy to assist any effort which will help reduce use of String.

For the text library, my main interest is performance. I've been far more focused on making the code correct and complete than on making it fast, but now that the API is stable and the test coverage is solid, performance deserves more attention. What specifics would be useful? - Reproducible, realistic (but preferably small) benchmarks - Patches that speed things up Thanks for asking!

Daniel Fischer

2:37 p.m.

On Wednesday 01 September 2010 06:08:07, Bryan O'Sullivan wrote:

...

New in this release:

- Substantial performance improvements. - Bug fixes, and better quality assurance via automated QuickCheck and HPC tests and coverage. - An even nicer API than before.

Good job.

...

The text library provides an efficient packed, immutable Unicode text type (both strict and lazy), with a powerful loop fusion optimization framework.

The 'Text' type represents Unicode character strings, in a time and space-efficient manner. This package provides text processing capabilities that are optimized for performance critical use, both in terms of large data quantities and high speed.

Hmm, I have to throw a few grains of salt into that. File: big.txt from Peter Norvig's spelling checker (6.2MiB) 1. read the file and output it to stdout (redirected to /dev/null) ByteString.Lazy: <= 0.01s text-0.7.2.1 (Lazy): 0.68s text-0.8.0.0 (Lazy): 0.26s String: 0.35s ByteString doesn't do any de/encoding or validity checking, so it's clear that can be way faster than I/O which does. The interesting part is the comparison between text and vanilla String I/O, the difference is smaller than I expected for text-0.8.0.0. It came as a surprise that text-0.7.2.1 actually had much worse performance than vanilla Strings. Okay, now we know how long the I/O takes, let's measure some work. 2. read the file, replace some string, output to stdout (redirected to /dev/null) a) "Professor" -> "Teacher" stringsearch (Lazy): 0.02s, 2MB total text-0.7.2.1 (Lazy): 0.92s, 28MB total text-0.7.2.1 (Strict): 0.90s, 65MB total text-0.8.0.0 (Lazy): 0.44s, 12MB total text-0.8.0.0 (Strict): 0.36S, 55MB total String (naive): 0.51s, 1MB total String (KMP): 0.46s, 1MB total b) "deteriorate" -> "worsen" stringsearch (Lazy): 0.02s, 2MB total text-0.7.2.1 (Lazy): 1.01s, 39MB total text-0.7.2.1 (Strict): 0.91s, 65MB total text-0.8.0.0 (Lazy): 0.45s, 17MB total text-0.8.0.0 (Strict): 0.36s, 55MB total String (naive): 0.52s, 1MB total String (KMP): 0.50s, 1MB total c) "hen" -> "here" stringsearch (Lazy): 0.04s, 2MB total text-0.7.2.1 (Lazy): 1.10s, 2MB total text-0.7.2.1 (Strict): 0.95s, 65MB total text-0.8.0.0 (Lazy): 0.66s, 2MB total text-0.8.0.0 (Strict): 0.41s, 55MB total String (naive): 0.53s, 1MB total String (KMP): 0.51s, 1MB total d) "kre" -> "here" stringsearch (Lazy): 0.03s, 2MB total text-0.7.2.1 (Lazy): 1.23s, 39MB total text-0.7.2.1 (Strict): 0.96s, 65MB total text-0.8.0.0 (Lazy): 0.65s, 17MB total text-0.8.0.0 (Strict): 0.40s, 55MB total String (naive): 0.51s, 1MB total String (KMP): 0.47s, 1MB total The performance of text-0.8.0.0 has improved significantly over that of text-0.7.2.1 (for the tested features), the improvement of the replacing algorithm is however not as impressive as that of I/O. Replacing isn't much faster than for Strings, however, for (some?) very short patterns, even slower. That's not so good. What's *really* bad is the space behaviour. My tests suggest that the space usage (for Text.Lazy) depends on the index of the pattern to replace, the later it appears (if at all), the more memory is used by text. Apparently it doesn't deliver the beginning before a match has been found (or the whole string has been scanned). That space leak ought to be fixed. The space behaviour of the strict variant doesn't show that problem, it seems to only depend on the file, but it uses disproportionally much memory.

Bryan O'Sullivan

4:15 p.m.

Hi, Daniel - Thanks for taking the new code for a test drive!

...

The interesting part is the comparison between text and vanilla String I/O, the difference is smaller than I expected for text-0.8.0.0.

Yes. Much of this is due to the new encoding stuff on Handles in GHC 6.12, which is slow. Its performance wasn't so noticeable when it was only shipping String around, but it's much more visible with Text. It's far slower on a Mac than on Linux, in case that's relevant.

...

The performance of text-0.8.0.0 has improved significantly over that of text-0.7.2.1 (for the tested features), the improvement of the replacing algorithm is however not as impressive as that of I/O.

I'd bet you that's mostly because the program is I/O bound, in the sense that it's spending time going through the layers of buffering and translation that now make up a Handle. Any improvement in other code is going to be hard to see because of that. The other major consideration, both for this case and the first one you note, is that the inliner in 6.12 chokes on code that uses stream fusion: it boxes and unboxes vast quantities of state. That kills performance due to both the boxing and unboxing overhead and the increased number of nursery GCs. The marvelous new 6.13 inliner does a *much* better job here - that's where I see those 3x performance improvements for free. What's *really* bad is the space behaviour.

...

What are you using to measure that? Also, please don't forget to post your benchmark code when you make observations like this, as that way I can reproduce your measurements and fix problems. I appreciate your help! Regards, Bryan.

Daniel Fischer

7:29 p.m.

On Wednesday 01 September 2010 18:15:19, Bryan O'Sullivan wrote:

...

Hi, Daniel -

Thanks for taking the new code for a test drive!

...
The interesting part is the comparison between text and vanilla String I/O, the difference is smaller than I expected for text-0.8.0.0.

Yes. Much of this is due to the new encoding stuff on Handles in GHC 6.12, which is slow. Its performance wasn't so noticeable when it was only shipping String around, but it's much more visible with Text. It's far slower on a Mac than on Linux, in case that's relevant.

I'm on Linux. I guess that's another point in favour of it:) Do you happen to know why it's slower on a Mac?

...

...
The performance of text-0.8.0.0 has improved significantly over that of text-0.7.2.1 (for the tested features), the improvement of the replacing algorithm is however not as impressive as that of I/O.

I'd bet you that's mostly because the program is I/O bound, in the sense that it's spending time going through the layers of buffering and translation that now make up a Handle. Any improvement in other code is going to be hard to see because of that.

Maybe. As a rough measure, I took (total time - time for just reading and outputting) for an approximation of the processing (replacing) time. For the first example (Lazy), that yields 0.7.2.1: 0.92s - 0.68s = 0.24s 0.8.0.0: 0.44s - 0.26s = 0.18s, for the fourth 0.7.2.1: 1.23s - 0.68s = 0.55s 0.8.0.0: 0.65s - 0.26s = 0.39s Yes, I know it's very crude, but TIO.readFile file >>= TIO.putStrLn does no less translation than with a replacing in between, or does it? So I tentatively believe most of the difference is spent doing the replacements.

...

The other major consideration, both for this case and the first one you note, is that the inliner in 6.12 chokes on code that uses stream fusion: it boxes and unboxes vast quantities of state. That kills performance due to both the boxing and unboxing overhead and the increased number of nursery GCs.

I haven't looked at the core, so I take your word for it. I know the behaviour from other situations.

...

The marvelous new 6.13 inliner does a *much* better job here - that's where I see those 3x performance improvements for free.

That's nice. Maybe I should go get the HEAD before 6.14.1 will be released.

...

What's *really* bad is the space behaviour.

What are you using to measure that?

+RTS -s and top. The figures I gave were the "total memory in use" of +RTS -s, maximum residency was 20MB (vs. 39 total) for 0.7 and 12MB (vs. 17 total) for 0.8 in the bad tests.

...

Also, please don't forget to post your benchmark code when you make observations like this,

It's not very interesting, {-# LANGUAGE BangPatterns #-} module Main (main) where import System.Environment (getArgs) import qualified Data.Text.Lazy as T import qualified Data.Text.Lazy.IO as TIO main :: IO () main = do (file : pat : sub : _) <- getArgs let !spat = T.pack pat !ssub = T.pack sub work = T.replace spat ssub TIO.readFile file >>= TIO.putStrLn . work with the obvious changes of imports for the strict Text and the almost obvious changes for the ByteString code.

...

as that way I can reproduce your measurements and fix problems. I appreciate your help!

I can now say more. Looking at Data.Text.Lazy.replace, replace s d = intercalate d . split s , I also got a space leak with that for BS.Lazy's intercalate and stringsearch's split. Looking at intercalate, intercalate t = concat . (Data.List.intersperse t) , that's where you definitely get a space leak, because intersperse :: a -> [a] -> [a] intersperse _ [] = [] intersperse _ [x] = [x] intersperse sep (x:xs) = x : sep : intersperse sep xs isn't lazy enough. Given the context, it's not hard to see what's wrong here. Before intersperse produces any output, it checks whether the list contains at least two elements (if any). If a is a list-like type, like String, lazy ByteStrings or lazy Text, where the elements of the list are lazily produced one after the other, as is the case for split, each element must be complete before it can be delivered and then consumed. So indeed, replace needs O(index pat) space :( I don't think that fixing Data.List.intersperse will fix your space leak, though. split pat src | null pat = emptyError "split" | isSingleton pat = splitBy (== head pat) src | otherwise = go 0 (indices pat src) src where go _ [] cs = [cs] go !i (x:xs) cs = let h :*: t = splitAtWord (x-i) cs in h : go (x+l) xs (dropWords l t) l = foldlChunks (\a (T.Text _ _ b) -> a + fromIntegral b) 0 pat Using the list of indices of the pattern, split can't deliver anything before the first (next) occurrence of the pattern has been located (or the end of the string reached). Again, that forces O(index pat) space consumption. I don't see how to avoid that without duplicating a large part of indices' logic in a dedicated breaking/splitting function. On a related note, break :: Text -> Text -> (Text, Text) break pat src | null pat = emptyError "break" | otherwise = case indices pat src of [] -> (src, empty) (x:_) -> let h :*: t = splitAtWord x src in (h, t) has the same problem. Cheers, Daniel

Daniel Fischer

8 p.m.

New subject: Laziness bug in Data.List.intersperse (was: ANNOUNCE: text 0.8.0.0, fast Unicode text support)

On Wednesday 01 September 2010 21:29:47, Daniel Fischer wrote:

...

that's where you definitely get a space leak, because

intersperse :: a -> [a] -> [a] intersperse _ [] = [] intersperse _ [x] = [x] intersperse sep (x:xs) = x : sep : intersperse sep xs

isn't lazy enough.

Ticket created, http://hackage.haskell.org/trac/ghc/ticket/4282 I'm not keen on subscribing to libraries@ to follow the official proposal process, any takers?

Bryan O'Sullivan

4 Sep 4 Sep

4:24 p.m.

New subject: Laziness bug in Data.List.intersperse (was: ANNOUNCE: text 0.8.0.0, fast Unicode text support)

On Wed, Sep 1, 2010 at 1:00 PM, Daniel Fischer wrote:

...

I'm not keen on subscribing to libraries@ to follow the official proposal process, any takers?

I'll take it up.

Bryan O'Sullivan

8 Sep 8 Sep

4:44 a.m.

On Wed, Sep 1, 2010 at 12:29 PM, Daniel Fischer wrote:

...

I'm on Linux. I guess that's another point in favour of it:) Do you happen to know why it's slower on a Mac?

I'd guess because of something to do with the system iconv.

...

So I tentatively believe most of the difference is spent doing the replacements.

I have a Replace.hs benchmark in the main text repo, just to be sure we're talking about the same thing. Factoring out the time spent on I/O, with GHC HEAD, my replace code takes twice the space and time of that in the stringsearch package. Given that the space involved is just 121KB maximum residency while processing a 124MB file, I'm not concerned about it. And the time required isn't a bad place to start from, I think. By the way, as this implies, I can't reproduce your space behaviour at all. I can now say more. Looking at Data.Text.Lazy.replace,

...

replace s d = intercalate d . split s

, I also got a space leak with that for BS.Lazy's intercalate and

...

stringsearch's split.

How did you observe the space leak? Looking at -hT charted with hp2ps shows me nothing, and the program executes in constant space regardless of input size.

Daniel Fischer

11:46 a.m.

On Wednesday 08 September 2010 06:44:32, Bryan O'Sullivan wrote:

...

I have a Replace.hs benchmark in the main text repo, just to be sure we're talking about the same thing.

Grabbed the darcs repo, compiled with the darcs version and also with the installed text package, both exhibit the same behaviour - what I reported before.

...

Factoring out the time spent on I/O, with GHC HEAD, my replace code takes twice the space and time of that in the stringsearch package.

That's very good. Twice the space is a given for mostly ASCII text, twice the time is okay, I think. My timings are quite different, but that's probably because 6.12.3's inliner doesn't give the full fusion benefit, so it'll improve automatically with the next GHC release.

...

Given that the space involved is just 121KB maximum residency while processing a 124MB file, I'm not concerned about it.

I wouldn't either.

...

And the time required isn't a bad place to start from, I think.

By the way, as this implies, I can't reproduce your space behaviour at all.

That's surprising. Have you made sure to replace a pattern which does not occur in the text? Can you reproduce the behaviour with a) Data.List.intersperse instead of the lazier version now used, b) ghc-6.12.* instead of HEAD? Anyway, I would've thought that with split pat src | null pat = emptyError "split" | isSingleton pat = splitBy (== head pat) src | otherwise = go 0 (indices pat src) src where go _ [] cs = [cs] go !i (x:xs) cs = let h :*: t = splitAtWord (x-i) cs in h : go (x+l) xs (dropWords l t) l = foldlChunks (\a (T.Text _ _ b) -> a + fromIntegral b) 0 pat you can't start returning chunks before it's known whether the list of indices is empty, so split would have O(index of pattern) space behaviour. If HEAD manages to make the chunks available before they are complete (before it's known how long they will be), it's even awesomer than I'd have dared to hope. Okay, so I'll have to try HEAD.

...

...
I can now say more. Looking at Data.Text.Lazy.replace,

replace s d = intercalate d . split s

, I also got a space leak with that for BS.Lazy's intercalate and

stringsearch's split.

How did you observe the space leak? Looking at -hT charted with hp2ps shows me nothing, and the program executes in constant space regardless of input size.

top and +RTS -s. I've attached the -hT graph of a run on a 74.3MB file with a pattern that does not occur. It shows exactly the behaviour I expected from the code of split, pretty constant growth of the heap until about twice the file size is reached, then fast and pretty constant shrinking as the text is output. The graphs are much more interesting if you do replacements of patterns without large gaps between occurrences.

Daniel Fischer

3:36 p.m.

On Wednesday 08 September 2010 13:46:13, Daniel Fischer wrote:

...

My timings are quite different, but that's probably because 6.12.3's inliner doesn't give the full fusion benefit, so it'll improve automatically with the next GHC release.

Or maybe not so much. Just built the latest source bundle from the HEAD branch and 6.12.3: ./nbench lazyText bigfile krkx rabi +RTS -s 1,796,245,884 bytes allocated in the heap 1,125,596 bytes copied during GC 110,398,048 bytes maximum residency (8 sample(s)) 38,897,164 bytes maximum slop 191 MB total memory in use (4 MB lost due to fragmentation) Generation 0: 3043 collections, 0 parallel, 3.06s, 3.17s elapsed Generation 1: 8 collections, 0 parallel, 0.00s, 0.01s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 6.03s ( 6.48s elapsed) GC time 3.07s ( 3.18s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 9.10s ( 9.66s elapsed) %GC time 33.7% (33.0% elapsed) Alloc rate 297,965,335 bytes per MUT second Productivity 66.3% of total user, 62.4% of total elapsed 6.13.20100831: ./hdbench lazyText bigfile krkx rabi +RTS -s 543,409,296 bytes allocated in the heap 699,364 bytes copied during GC 110,956,008 bytes maximum residency (8 sample(s)) 38,893,040 bytes maximum slop 191 MB total memory in use (4 MB lost due to fragmentation) Generation 0: 652 collections, 0 parallel, 0.44s, 0.43s elapsed Generation 1: 8 collections, 0 parallel, 0.00s, 0.01s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 5.42s ( 5.77s elapsed) GC time 0.44s ( 0.44s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 5.86s ( 6.21s elapsed) %GC time 7.5% (7.1% elapsed) Alloc rate 100,327,729 bytes per MUT second Productivity 92.5% of total user, 87.2% of total elapsed Sure, that's a significant improvement, but that's mostly the GC time, with -A64M, 6.12.3 is much closer. However, for ByteStrings, performance got worse: 6.12.3: ./nbench lazyBS bigfile krkx rabi +RTS -s 90,127,112 bytes allocated in the heap 31,116 bytes copied during GC 103,396 bytes maximum residency (1 sample(s)) 39,964 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 158 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 0.10s ( 0.20s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.11s ( 0.20s elapsed) %GC time 3.6% (1.8% elapsed) Alloc rate 834,456,211 bytes per MUT second Productivity 92.9% of total user, 50.9% of total elapsed 6.13.20100831: ./hdbench lazyBS bigfile krkx rabi +RTS -s 478,710,672 bytes allocated in the heap 164,904 bytes copied during GC 86,992 bytes maximum residency (1 sample(s)) 44,080 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 864 collections, 0 parallel, 0.00s, 0.01s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 0.17s ( 0.28s elapsed) GC time 0.00s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.18s ( 0.29s elapsed) %GC time 2.3% (4.1% elapsed) Alloc rate 2,783,039,776 bytes per MUT second Productivity 95.5% of total user, 57.3% of total elapsed Not only got it slower, it also allocates more than five times as much as before.

...

...
Given that the space involved is just 121KB maximum residency while processing a 124MB file, I'm not concerned about it.

I wouldn't either.

But it needs more space here, so I am concerned.

...

...
And the time required isn't a bad place to start from, I think.

By the way, as this implies, I can't reproduce your space behaviour at all.

That's surprising. Have you made sure to replace a pattern which does not occur in the text? Can you reproduce the behaviour with a) Data.List.intersperse instead of the lazier version now used, b) ghc-6.12.* instead of HEAD?

Anyway, I would've thought that with

split pat src

| null pat = emptyError "split" | isSingleton pat = splitBy (== head pat) src | otherwise = go 0 (indices pat src) src

where go _ [] cs = [cs] go !i (x:xs) cs = let h :*: t = splitAtWord (x-i) cs in h : go (x+l) xs (dropWords l t) l = foldlChunks (\a (T.Text _ _ b) -> a + fromIntegral b) 0 pat

you can't start returning chunks before it's known whether the list of indices is empty, so split would have O(index of pattern) space behaviour.

If HEAD manages to make the chunks available before they are complete (before it's known how long they will be), it's even awesomer than I'd have dared to hope. Okay, so I'll have to try HEAD.

Doesn't do much here. Still leaking, still > 20 times slower than ByteString, even for the now far worse ByteString times. What's going on?

5424

Age (days ago)

5431

Last active (days ago)

List overview

Download

17 comments

8 participants

participants (8)

Brandon S Allbery KF8NH
Bryan O'Sullivan
Daniel Fischer
David Virebayre
Ivan Lazar Miljenovic
John Millikin
Kevin Jardine
Tako Schotanus