Constructing Data.Map from ByteString

Hi all, I've been plugging away at this all day and some discussion in #haskell has been fruitless. Perhaps you have the inspiration to see what's happening! Concerning this minimal example: http://hpaste.org/6268 It works as required, loading K/V pairs into a Data.Map, the concern is the amount of memory used. Pausing (using a getLine) after the readFile one can see (through 'ps v') that the whole file 'f' is loaded in to memory. However once the Map is created the memory footprint swells to around five times the size. Surely this can't be just overhead from Map? My reasoning was that by using ByteString and getting the whole file into memory a small and linear increase would be seen for Map overhead.. I have tried using both Data.ByteString.Char8 and Data.ByteString.Lazy.Char8 with negligible difference. For a hoot I tried it with String and yes, it's ridiculous :) Cheers, Dave

Just a few updates on this one:
I've upgraded to bytestring-0.9.0.5 from Darcs, no improvement.
Also this morning I tried using Data.HashMap with Bytestring's readInt
and HashMap's hashInt.. The result was a Stack space overflow :(
God damn it I don't want to have to go back to C++ but soon will have
no choice with time constraints :(
Dave,
On 10/03/2008, Dave Tapley
Hi all,
I've been plugging away at this all day and some discussion in #haskell has been fruitless. Perhaps you have the inspiration to see what's happening!
Concerning this minimal example: http://hpaste.org/6268
It works as required, loading K/V pairs into a Data.Map, the concern is the amount of memory used. Pausing (using a getLine) after the readFile one can see (through 'ps v') that the whole file 'f' is loaded in to memory. However once the Map is created the memory footprint swells to around five times the size. Surely this can't be just overhead from Map? My reasoning was that by using ByteString and getting the whole file into memory a small and linear increase would be seen for Map overhead..
I have tried using both Data.ByteString.Char8 and Data.ByteString.Lazy.Char8 with negligible difference. For a hoot I tried it with String and yes, it's ridiculous :)
Cheers,
Dave

On Tue, Mar 11, 2008 at 4:13 AM, Dave Tapley
I've upgraded to bytestring-0.9.0.5 from Darcs, no improvement. Also this morning I tried using Data.HashMap with Bytestring's readInt and HashMap's hashInt.. The result was a Stack space overflow :(
Map is probably a better choice than HashMap. Exactly how large is the input? Like I said, I got about 2x bloat (measured by RSS), maybe you can send a fuller example code? AGL -- Adam Langley agl@imperialviolet.org http://www.imperialviolet.org

On Tue, 2008-03-11 at 11:13 +0000, Dave Tapley wrote:
Just a few updates on this one:
I've upgraded to bytestring-0.9.0.5 from Darcs, no improvement. Also this morning I tried using Data.HashMap with Bytestring's readInt and HashMap's hashInt.. The result was a Stack space overflow :(
God damn it I don't want to have to go back to C++ but soon will have no choice with time constraints :(
Just to keep haskel-cafe readers up-to-date with the discussion of this on #haskell... So if we take a step back and look at the size and form of the data. The input is a huge file of the form: 465 foo 1236 bar 594 baz And we have hundreds of megabytes of this. So more precisely it's lines with an integer (apparently guaranteed to fit into a 32bit unsigned word) a tab and a shortish string. The overall line length is around 10-15 characters in general. So let's look at some sizes: Let's assume the lines are 13 characters long (including \n terminator) so that's just 13 bytes. data Map k a = Tip | Bin {-# UNPACK #-} !Size !k a !(Map k a) !(Map k a) So an Data.Map node is 6 words. data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8) {-# UNPACK #-} !Int -- offset {-# UNPACK #-} !Int -- length This is also 5 words (because ForeignPtr unpacks to 2 words). So if we make a Map ByteString ByteString and we have an entry per line then we need 5+5+6 words. On a 32bit machine that's (5+5+6)*4 = 64bytes. So 64bytes storage for each 13 byte line. So that's approximately a factor of 5. That explains the original observation. So what is a more memory efficient representation? Well we can eliminate the ByteString key by taking advantage of the fact that the key is always guaranteed to fit into a Word. So we can use an IntMap. An IntMap is 3-5 words per node (3 for leaf and 5 for internal for an average of 4). Then the ByteString map value has redundancy, we know that all of them are substrings of the same big ByteString so we could factor that out and use a: data SubStr = SubStr {-# UNPACK #-} !Int -- offset {-# UNPACK #-} !Int -- length That's 3 words. So that's still (4+3)*4 = 28 bytes per 13 byte line. To get something really compact we could use an index composed of three unboxed Int arrays. One for the key, one for the offset and one for the length. We'd keep the key array sorted and the other two arrays synchronised. We'd use a binary search on the key array and then look up the corresponding entries in the offset and length arrays. That representation uses 3 words = 12 bytes per line. We could reduce that to 10 or 9 bytes if we assume the key lengths fit in a Word16 or Word8. So that gets us down to just double the original input file size. We could further reduce this by making a new value ByteString consisting of just all the values concatenated together without the keys in between. That'd temporarily use another copy but that might be acceptable given that this index is going to be used heavily later with more data. Duncan

Duncan Coutts
To get something really compact we could use an index composed of three unboxed Int arrays.
To get something *really* compact, we could build a (kind of) suffix array. That is, we do a lexical sort of the lines, and store the sorted offsets of the lines in an array. You can then look up any index by binary search using this array. That's 10-15 characters plus one offset (4 bytes) per line, ~1.3 times the original file. (There are also compressed suffix array structures, but I don't think those will gain you anything if you don't store all the positions.) -k -- If I haven't seen further, it is by standing in the footprints of giants

"Dave Tapley"
I've upgraded to bytestring-0.9.0.5 from Darcs, no improvement. Also this morning I tried using Data.HashMap with Bytestring's readInt and HashMap's hashInt.. The result was a Stack space overflow :(
That's not so good.
It works as required, loading K/V pairs into a Data.Map, the concern is the amount of memory used. Pausing (using a getLine) after the readFile one can see (through 'ps v') that the whole file 'f' is loaded in to memory.
However once the Map is created the memory footprint swells to around five times the size. Surely this can't be just overhead from Map? My reasoning was that by using ByteString and getting the whole file into memory a small and linear increase would be seen for Map overhead..
What's the average line length? Roughly, the Data.Map will contain 2*#lines nodes - at some bytes each, and the #lines leaf nodes will have pointers to two ByteStrings, which carry an overhead of three pointers (IIRC: char array, offset, length). Not sure about the size of Map internal nodes, but I assume it will be at least two pointers to children, and possibly the size as well? If we presume three words per node, we get about 6*#lines*wordsize overhead, so 24*#lines on 32 bits, and 48*#lines on 64 a bit architecture. Copying GC means that in reality you need -- or at least, the program will consume, if available¹ -- a good bit more. (And yes, since each string is just a pointer into the orignial file contents, you need to retain that as well.) Short answer: Maps are expensive in terms of memory. One option could be to pack the keys into Ints (if they're short enough) and use Data.IntMap. Another could be to just sort an Array of offsets into the file contents and use binary search. -k ¹) If your program is thrashing, make sure you limit heap to 80% of physical memory (use +RTS -Mxxxx). -- If I haven't seen further, it is by standing in the footprints of giants

On Mon, Mar 10, 2008 at 4:33 PM, Dave Tapley
Concerning this minimal example: http://hpaste.org/6268
Change the Data.ByteString.Char8 to Data.ByteString.Lazy.Char8, also the takeWhile, dropWhile pair is better expressed using span. With those two changes, and a 150MB test file, a compiled binary is using about 220MB of resident space. That seems pretty reasonable to me. As a purely subjective atheistic point: importing some modules qualified will help your sanity. Otherwise you have lots of common function names (readFile etc) and you have to try and figure out which module they are going to resolve into all the time! Cheers, AGL -- Adam Langley agl@imperialviolet.org http://www.imperialviolet.org

On Mon, Mar 10, 2008 at 11:33:58PM +0000, Dave Tapley wrote:
I've been plugging away at this all day and some discussion in #haskell has been fruitless. Perhaps you have the inspiration to see what's happening!
Concerning this minimal example: http://hpaste.org/6268
It works as required, loading K/V pairs into a Data.Map, the concern is the amount of memory used. Pausing (using a getLine) after the readFile one can see (through 'ps v') that the whole file 'f' is loaded in to memory. However once the Map is created the memory footprint swells to around five times the size. Surely this can't be just overhead from Map? My reasoning was that by using ByteString and getting the whole file into memory a small and linear increase would be seen for Map overhead..
Just looking at the code, I'd guess that it's making at least three copies of the data in total, plus Map overhead, so five times larger before garbage collection isn't completely outrageous. (Unless my intuition of what ByteString is doing is completely off-base of course.) Don't forget that Map construction is lazy in its value argument by default: Lots of potential for thunks to hang around the place.
I have tried using both Data.ByteString.Char8 and Data.ByteString.Lazy.Char8 with negligible difference. For a hoot I tried it with String and yes, it's ridiculous :)
Strings won't help, since you're modifying the data, so you get to pay the string overhead without getting any sharing advantage from the String list implementation. You might do better feeding the input to a proper parser to build the map -- At least then you'd only be copying the input ByteString data once. I don't think you can do much better than that in an immutable world. Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

dave.a.tapley:
Hi all,
I've been plugging away at this all day and some discussion in #haskell has been fruitless. Perhaps you have the inspiration to see what's happening!
Concerning this minimal example: http://hpaste.org/6268
It works as required, loading K/V pairs into a Data.Map, the concern is the amount of memory used. Pausing (using a getLine) after the readFile one can see (through 'ps v') that the whole file 'f' is loaded in to memory. However once the Map is created the memory footprint swells to around five times the size. Surely this can't be just overhead from Map? My reasoning was that by using ByteString and getting the whole file into memory a small and linear increase would be seen for Map overhead..
I have tried using both Data.ByteString.Char8 and Data.ByteString.Lazy.Char8 with negligible difference. For a hoot I tried it with String and yes, it's ridiculous :)
Speaking to you on #haskell we worked out that the keys are integers, and elements can be bytestrings, so an IntMap ByteString seems like a good idea. The attached builds the Map directly (avoiding the lines splitting of the file), and seems to use around half the heap of the generic Map. It also builds much faster. Still, for big data, I'm not sure that a more specialised structure wouldn't be better. -- Don
participants (6)
-
Adam Langley
-
Dave Tapley
-
Don Stewart
-
Duncan Coutts
-
Ketil Malde
-
Philip Armstrong