ByteStrings and the ram barrier

12 May 2006

      Hey libraries@,

Data.ByteStrings are strict, unboxed arrays of Word8s. They're great for
fast processing of strings that fit in memory but no good for lazy
processing, or dealing with data larger than your ram (i.e. they're
infeasible above 1G on a 2G ram box).

Last week Duncan Coutts had the bright idea to write
Data.ByteString.Lazy, a newtyped [ByteString], which would allow
constant space processing of chunks of strict ByteString arrays. 

The theory is that we'd be able to efficiently process large data, where
both ByteString and [Char] fails, and open a new range of applications
that could be handled successfully with Haskell.

By using list operations over the spine of the structure, and fast
ByteString operations over the leaves, we should be able to get some of
the benefits of both lazy lists and strict bytestrings.

For example, to map over a lazy bytestring you do a list map along the
spine of the list, and a bytestring map over the nodes:

    lazy bytestring map  = L.map (B.map f) xs

So we hacked away at it for a few days, and here are some preliminary results:
Simple task, filter the letter 'e' from a file.

------------------------------------------------------------------------

4G file, 1G ram, Linux/x86, 3.20GHz P4, GHC 6.4.1

    Data.ByteString.Lazy (128k chunks)
        constant heap, total time 2m 20s, 25% cpu

    Data.List
        constant heap, total time 8 minutes (quite good!), 98% cpu

    sed 's/e//g'
        constant heap,  I gave up after 10 minutes

------------------------------------------------------------------------

10G file, 1G ram, OpenBSD/x86, 2.4GHz P4, GHC 6.4.1

    Data.ByteString.Lazy (5M chunks)
        constant heap, total time 11.44 minutes, 34% cpu

    Data.List   
        constant heap, total time 18.21 minutes, 90% cpu

------------------------------------------------------------------------

A nice improvement for Haskell on mutli-gigabyte files, I think!

Data.ByteString.Lazy is in fps head, here:

    http://www.cse.unsw.edu.au/~dons/code/fps/Data/ByteString/Lazy.hs    

Cheers,
  Don

ByteStrings and the ram barrier

dons＠cse.unsw.edu.au