
dons:
Jefferson Heard write:
I'm using the Data.AltBinary package to read in a list of 4.8 million floats and 1.6 million ints. Doing so caused the memory footprint to blow up to more than 2gb, which on my laptop simply causes the program to crash. I can do it on my workstation, but I'd really rather not, because I want my program to be fairly portable.
The file that I wrote out in packing the data structure was only 28MB, so I assume I'm just using the wrong data structure, or I'm using full laziness somewhere I shouldn't be.
Here's a quick example of how to efficient read and write such a structure to disk, compressing and decompressing on the fly.
$ time ./A Wrote 4800000 floats, and 1600000 ints Read 4800000 floats, and 1600000 ints ./A 0.93s user 0.06s system 89% cpu 1.106 total
It uses Data.Binary to provide quick serialisation, and the zlib library to compress the resulting stream. It builds the tables in memory, writes and compresses the result to disk, reads them back in, and checks we read the right amount of CFloats and CInts. You'd then pass the CFloats over to your C library that needs them.
Compressing with zlib is a flourish, but cheap and simple, so we may as well do it. With zlib and Data.Binary, the core code just becomes:
encodeFile "/tmp/table.gz" table table' <- decodeFile "/tmp/table.gz"
Which transparently streams the data through zlib, and onto the disk, and back.
Simple and efficient.
Oh, and profiling this code: $ ghc -prof -auto-all -O2 --make A.hs $ ./A +RTS -p Wrote 4800000 floats, and 1600000 ints Read 4800000 floats, and 1600000 ints $ cat A.prof Mon Jul 9 12:44 2007 Time and Allocation Profiling Report (Final) total time = 0.90 secs (18 ticks @ 50 ms) total alloc = 26,087,140 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc main Main 100.0 100.0 Looks fine. We'd expect at least 25,600,000 bytes, and a little overhead for the runtime system. I note that the compressed file on disk is 26k too (yay for gzip on zeros ;) -- Don