
"Dave Tapley"
I've upgraded to bytestring-0.9.0.5 from Darcs, no improvement. Also this morning I tried using Data.HashMap with Bytestring's readInt and HashMap's hashInt.. The result was a Stack space overflow :(
That's not so good.
It works as required, loading K/V pairs into a Data.Map, the concern is the amount of memory used. Pausing (using a getLine) after the readFile one can see (through 'ps v') that the whole file 'f' is loaded in to memory.
However once the Map is created the memory footprint swells to around five times the size. Surely this can't be just overhead from Map? My reasoning was that by using ByteString and getting the whole file into memory a small and linear increase would be seen for Map overhead..
What's the average line length? Roughly, the Data.Map will contain 2*#lines nodes - at some bytes each, and the #lines leaf nodes will have pointers to two ByteStrings, which carry an overhead of three pointers (IIRC: char array, offset, length). Not sure about the size of Map internal nodes, but I assume it will be at least two pointers to children, and possibly the size as well? If we presume three words per node, we get about 6*#lines*wordsize overhead, so 24*#lines on 32 bits, and 48*#lines on 64 a bit architecture. Copying GC means that in reality you need -- or at least, the program will consume, if available¹ -- a good bit more. (And yes, since each string is just a pointer into the orignial file contents, you need to retain that as well.) Short answer: Maps are expensive in terms of memory. One option could be to pack the keys into Ints (if they're short enough) and use Data.IntMap. Another could be to just sort an Array of offsets into the file contents and use binary search. -k ¹) If your program is thrashing, make sure you limit heap to 80% of physical memory (use +RTS -Mxxxx). -- If I haven't seen further, it is by standing in the footprints of giants