ByteStrings and the ram barrier

Hey libraries@, Data.ByteStrings are strict, unboxed arrays of Word8s. They're great for fast processing of strings that fit in memory but no good for lazy processing, or dealing with data larger than your ram (i.e. they're infeasible above 1G on a 2G ram box). Last week Duncan Coutts had the bright idea to write Data.ByteString.Lazy, a newtyped [ByteString], which would allow constant space processing of chunks of strict ByteString arrays. The theory is that we'd be able to efficiently process large data, where both ByteString and [Char] fails, and open a new range of applications that could be handled successfully with Haskell. By using list operations over the spine of the structure, and fast ByteString operations over the leaves, we should be able to get some of the benefits of both lazy lists and strict bytestrings. For example, to map over a lazy bytestring you do a list map along the spine of the list, and a bytestring map over the nodes: lazy bytestring map = L.map (B.map f) xs So we hacked away at it for a few days, and here are some preliminary results: Simple task, filter the letter 'e' from a file. ------------------------------------------------------------------------ 4G file, 1G ram, Linux/x86, 3.20GHz P4, GHC 6.4.1 Data.ByteString.Lazy (128k chunks) constant heap, total time 2m 20s, 25% cpu Data.List constant heap, total time 8 minutes (quite good!), 98% cpu sed 's/e//g' constant heap, I gave up after 10 minutes ------------------------------------------------------------------------ 10G file, 1G ram, OpenBSD/x86, 2.4GHz P4, GHC 6.4.1 Data.ByteString.Lazy (5M chunks) constant heap, total time 11.44 minutes, 34% cpu Data.List constant heap, total time 18.21 minutes, 90% cpu ------------------------------------------------------------------------ A nice improvement for Haskell on mutli-gigabyte files, I think! Data.ByteString.Lazy is in fps head, here: http://www.cse.unsw.edu.au/~dons/code/fps/Data/ByteString/Lazy.hs Cheers, Don

Hello Donald, Friday, May 12, 2006, 10:07:47 AM, you wrote:
Last week Duncan Coutts had the bright idea to write Data.ByteString.Lazy, a newtyped [ByteString], which would allow constant space processing of chunks of strict ByteString arrays.
i have somewhat similar idea - implement getContents variany that returns list of LINES in file, represented as [ByteString]
Data.ByteString.Lazy (5M chunks) constant heap, total time 11.44 minutes, 34% cpu
try to use smaller chunks, don't forget that P4 has 512-1024k cache -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
Hello Donald,
Friday, May 12, 2006, 10:07:47 AM, you wrote:
Last week Duncan Coutts had the bright idea to write Data.ByteString.Lazy, a newtyped [ByteString], which would allow constant space processing of chunks of strict ByteString arrays.
i have somewhat similar idea - implement getContents variany that returns list of LINES in file, represented as [ByteString]
Data.ByteString.Lazy (5M chunks) constant heap, total time 11.44 minutes, 34% cpu
try to use smaller chunks, don't forget that P4 has 512-1024k cache
In fact, here's a graph of time versus chunk size, for my Pentium M laptop, with a 256k L2 cache: http://www.cse.unsw.edu.au/~dons/tmp/chunksize_v_cache.png Lesson: chunksize should be between 0.5 and 1x the L2 cache, and things get much worse as soon as you go beyond your cache size. -- Don

bulat.ziganshin:
Hello Donald,
Friday, May 12, 2006, 10:07:47 AM, you wrote:
Last week Duncan Coutts had the bright idea to write Data.ByteString.Lazy, a newtyped [ByteString], which would allow constant space processing of chunks of strict ByteString arrays.
i have somewhat similar idea - implement getContents variany that returns list of LINES in file, represented as [ByteString]
Data.ByteString.Lazy (5M chunks) constant heap, total time 11.44 minutes, 34% cpu
try to use smaller chunks, don't forget that P4 has 512-1024k cache
After tuning the chunk size to the cache size, the filter 'e' test now runs in: 4.52 minutes, 53% cpu. Which I think is pretty spectacular for 10 gigabytes of data. -- Don

On Fri, May 12, 2006 at 04:07:47PM +1000, Donald Bruce Stewart wrote:
The theory is that we'd be able to efficiently process large data, where both ByteString and [Char] fails, and open a new range of applications that could be handled successfully with Haskell.
My large data files are already divided into reasonably sized chunks and I think this approach is quite widespread - at least Google also processes much of their data in chunks. To process my data with Haskell, I would have to be able to decode it into records with efficiency close to what I achieve in C++ (say, at least 80% as fast). Until now I managed to get 33% of it in reasonably pure Haskell, which is not that bad IMO. However, I feel that I've hit a barrier now and will have to use raw Ptr's or FFI. Maybe I could try pushing it through the haskell-cafe optimisation ;-) Anyway, the point is that large data tends to be divided into smaller chunks not only because it's impossible to load the whole file into memory, but also to allow random access, to help distributing the computation over many computers, etc. So, I am not sure Haskell would gain that much by being able to process terabytes of data in one go. On the other hand, this is quite cool and I am probably wrong, being concentrated on my needs. Best regards Tomasz

On Tue, May 16, 2006 at 11:10:04AM +0200, Tomasz Zielonka wrote:
To process my data with Haskell, I would have to be able to decode it into records with efficiency close to what I achieve in C++ (say, at least 80% as fast). Until now I managed to get 33% of it in reasonably pure Haskell, which is not that bad IMO.
I forgot to add that I use Data.ByteString and mostly checked head/tail/drop/take operations. I am already amazed that using those high-level operations gives me such a good performance. I mean, I know what tricks you use to make it possible, but anyway it's remarkable... But, like I said before, I will probably have to juggle Ptr's to get through this 33% barrier. But I will surely still use ByteString, at least for reading data from files and to keep the decoded strings. BTW, would you recommend ByteString for keeping thousands of small decoded strings? Their size will be around 50-100 characters (they are mostly URIs) and their expected life time will be quite short (I expect to hold around 10-100 such strings in memory 97% of the time, and up to 1000000 strings in some rare cases). Thanks for working on ByteString! Best regards Tomasz
participants (3)
-
Bulat Ziganshin
-
dons@cse.unsw.edu.au
-
Tomasz Zielonka