
Hey libraries@, Data.ByteStrings are strict, unboxed arrays of Word8s. They're great for fast processing of strings that fit in memory but no good for lazy processing, or dealing with data larger than your ram (i.e. they're infeasible above 1G on a 2G ram box). Last week Duncan Coutts had the bright idea to write Data.ByteString.Lazy, a newtyped [ByteString], which would allow constant space processing of chunks of strict ByteString arrays. The theory is that we'd be able to efficiently process large data, where both ByteString and [Char] fails, and open a new range of applications that could be handled successfully with Haskell. By using list operations over the spine of the structure, and fast ByteString operations over the leaves, we should be able to get some of the benefits of both lazy lists and strict bytestrings. For example, to map over a lazy bytestring you do a list map along the spine of the list, and a bytestring map over the nodes: lazy bytestring map = L.map (B.map f) xs So we hacked away at it for a few days, and here are some preliminary results: Simple task, filter the letter 'e' from a file. ------------------------------------------------------------------------ 4G file, 1G ram, Linux/x86, 3.20GHz P4, GHC 6.4.1 Data.ByteString.Lazy (128k chunks) constant heap, total time 2m 20s, 25% cpu Data.List constant heap, total time 8 minutes (quite good!), 98% cpu sed 's/e//g' constant heap, I gave up after 10 minutes ------------------------------------------------------------------------ 10G file, 1G ram, OpenBSD/x86, 2.4GHz P4, GHC 6.4.1 Data.ByteString.Lazy (5M chunks) constant heap, total time 11.44 minutes, 34% cpu Data.List constant heap, total time 18.21 minutes, 90% cpu ------------------------------------------------------------------------ A nice improvement for Haskell on mutli-gigabyte files, I think! Data.ByteString.Lazy is in fps head, here: http://www.cse.unsw.edu.au/~dons/code/fps/Data/ByteString/Lazy.hs Cheers, Don