
On Tue, 2008-03-11 at 11:13 +0000, Dave Tapley wrote:
Just a few updates on this one:
I've upgraded to bytestring-0.9.0.5 from Darcs, no improvement. Also this morning I tried using Data.HashMap with Bytestring's readInt and HashMap's hashInt.. The result was a Stack space overflow :(
God damn it I don't want to have to go back to C++ but soon will have no choice with time constraints :(
Just to keep haskel-cafe readers up-to-date with the discussion of this on #haskell... So if we take a step back and look at the size and form of the data. The input is a huge file of the form: 465 foo 1236 bar 594 baz And we have hundreds of megabytes of this. So more precisely it's lines with an integer (apparently guaranteed to fit into a 32bit unsigned word) a tab and a shortish string. The overall line length is around 10-15 characters in general. So let's look at some sizes: Let's assume the lines are 13 characters long (including \n terminator) so that's just 13 bytes. data Map k a = Tip | Bin {-# UNPACK #-} !Size !k a !(Map k a) !(Map k a) So an Data.Map node is 6 words. data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8) {-# UNPACK #-} !Int -- offset {-# UNPACK #-} !Int -- length This is also 5 words (because ForeignPtr unpacks to 2 words). So if we make a Map ByteString ByteString and we have an entry per line then we need 5+5+6 words. On a 32bit machine that's (5+5+6)*4 = 64bytes. So 64bytes storage for each 13 byte line. So that's approximately a factor of 5. That explains the original observation. So what is a more memory efficient representation? Well we can eliminate the ByteString key by taking advantage of the fact that the key is always guaranteed to fit into a Word. So we can use an IntMap. An IntMap is 3-5 words per node (3 for leaf and 5 for internal for an average of 4). Then the ByteString map value has redundancy, we know that all of them are substrings of the same big ByteString so we could factor that out and use a: data SubStr = SubStr {-# UNPACK #-} !Int -- offset {-# UNPACK #-} !Int -- length That's 3 words. So that's still (4+3)*4 = 28 bytes per 13 byte line. To get something really compact we could use an index composed of three unboxed Int arrays. One for the key, one for the offset and one for the length. We'd keep the key array sorted and the other two arrays synchronised. We'd use a binary search on the key array and then look up the corresponding entries in the offset and length arrays. That representation uses 3 words = 12 bytes per line. We could reduce that to 10 or 9 bytes if we assume the key lengths fit in a Word16 or Word8. So that gets us down to just double the original input file size. We could further reduce this by making a new value ByteString consisting of just all the values concatenated together without the keys in between. That'd temporarily use another copy but that might be acceptable given that this index is going to be used heavily later with more data. Duncan