Re: [Haskell-cafe] Compress and serialise data with lazy bytestrings, zlib and Data.Binary (was: Allocating enormous amounts of memory)

9 Jul 2007

      dons:
...
Jefferson Heard write:
...
I'm using the Data.AltBinary package to read in a list of 4.8 million
floats and 1.6 million ints.  Doing so caused the memory footprint to
blow up to more than 2gb, which on my laptop simply causes the program
to crash.  I can do it on my workstation, but I'd really rather not,
because I want my program to be fairly portable.
The file that I wrote out in packing the data structure was only 28MB,
so I assume I'm just using the wrong data structure, or I'm using full
laziness somewhere I shouldn't be.
Here's a quick example of how to efficient read and write such a structure to
disk, compressing and decompressing on the fly.
$ time ./A
    Wrote 4800000 floats, and 1600000 ints
    Read  4800000 floats, and 1600000 ints
    ./A  0.93s user 0.06s system 89% cpu 1.106 total
It uses Data.Binary to provide quick serialisation, and the zlib library to
compress the resulting stream. It builds the tables in memory, writes and
compresses the result to disk, reads them back in, and checks we read the right
amount of CFloats and CInts. You'd then pass the CFloats over to your C library
that needs them.
Compressing with zlib is a flourish, but cheap and simple, so we may as well do
it. With zlib and Data.Binary, the core code just becomes:
encodeFile "/tmp/table.gz" table
        table' <- decodeFile "/tmp/table.gz"
Which transparently streams the data through zlib, and onto the disk, and back.
Simple and efficient.
Oh, and profiling this code:

    $ ghc -prof -auto-all -O2 --make A.hs

    $ ./A +RTS -p                        
    Wrote 4800000 floats, and 1600000 ints
    Read 4800000 floats, and 1600000 ints

    $ cat A.prof 
        Mon Jul  9 12:44 2007 Time and Allocation Profiling Report  (Final)

        total time  =        0.90 secs   (18 ticks @ 50 ms)
        total alloc =  26,087,140 bytes  (excludes profiling overheads)

    COST CENTRE                    MODULE               %time %alloc
    main                           Main                 100.0  100.0

Looks fine. We'd expect at least 25,600,000 bytes, and a little overhead for the 
runtime system. I note that the compressed file on disk is 26k too (yay for
gzip on zeros ;)

-- Don

Re: [Haskell-cafe] Compress and serialise data with lazy bytestrings, zlib and Data.Binary (was: Allocating enormous amounts of memory)

dons＠cse.unsw.edu.au