Compressed Data.Map for more efficient RAM usage?

Sometimes we want to store very large collection types in RAM -- such as a Data.Map or Data.IxSet. It seems like we could trade-off some speed for space savings by compressing the values in RAM. Lemmih has previously created compact-map: http://hackage.haskell.org/package/compact-map which mightybyte used to create: https://github.com/mightybyte/compact-ixset compact-map converts the Haskell values to a more compact binary representation. But could we do even better by compressing the bytestrings? Here is a useless implementation: http://hpaste.org/63836 That version compresses each value by itself. But, that is probably going to be worse RAM usage, not better. I think what we really need to do is to store a dictionary in CMap that is used to compress all the ByteString values. That way if the same string appears in 10000 entries, they can all be compressed using the same compression code. One question is, how much would this benefit us? As an experiment I tried compressing one checkpoint file for real world acid-state app that I am involved with. A checkpoint file contains binary data that is serialized by the SafeCopy library. And it is data that is currently store in an IxSet in RAM. The uncompressed checkpoint file is 115MB. The compressed checkpoint file is 26MB. A checkpoint file probably contains more redundancy than an equivalent compressed IxSet because in addition to the values it also includes version tags and other easily compressed data. But, those numbers do look promising. At the very least, they do not rule out the possibility of some big savings. I do not have time to explore this anymore. But I think it could be a fun project for someone to work on. ;) I imagine a prototype with a shared dictionary could be hacked up in a few days. A bit more work would be required to document it, memory profile it, etc. If done well, we should be able to reuse the incremental compression code in Data.Map, Data.Set, Data.IxSet, HiggsSet, and other places. But, just focusing on Data.Map is a good start. I will create a bug in this bug tracker that links back to this message, http://code.google.com/p/happstack/issues/list - jeremy

On Thu, Feb 16, 2012 at 3:51 PM, Jeremy Shaw
Sometimes we want to store very large collection types in RAM -- such as a Data.Map or Data.IxSet.
It seems like we could trade-off some speed for space savings by compressing the values in RAM.
Lemmih has previously created compact-map:
http://hackage.haskell.org/package/compact-map
which mightybyte used to create:
https://github.com/mightybyte/compact-ixset
compact-map converts the Haskell values to a more compact binary representation. But could we do even better by compressing the bytestrings?
Here is a useless implementation:
You could have a re-implemented HashMap which would un-pack the payload's ByteString constructor into the leaves of the HashMap type itself. Then you would save on both the keys and the values. This assumes you're not using the ordering provided by Data.Map. Antoine

On Thu, Feb 16, 2012 at 2:03 PM, Antoine Latter
You could have a re-implemented HashMap which would un-pack the payload's ByteString constructor into the leaves of the HashMap type itself.
Then you would save on both the keys and the values.
Note that ByteString has a high per-value overhead due to the internal use of ForeignPtrs and indicies that track offset/size: http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.ht... -- Johan

On Thu, Feb 16, 2012 at 4:29 PM, Johan Tibell
On Thu, Feb 16, 2012 at 2:03 PM, Antoine Latter
wrote: You could have a re-implemented HashMap which would un-pack the payload's ByteString constructor into the leaves of the HashMap type itself.
Then you would save on both the keys and the values.
Hah hah - forget what I said. HashMap may be smaller in terms of the structure of the tree it builds, but it can't be smaller in the size of its keys since the hash is lossy :-/
Note that ByteString has a high per-value overhead due to the internal use of ForeignPtrs and indicies that track offset/size: http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.ht...
You could store just the raw GHC ByteArray, which drops the indices. You could then promote to a ByteString to call into standard deserialization libraries. I don't know at what point we descend too far into voodoo-magic.
-- Johan

On Thu, Feb 16, 2012 at 1:51 PM, Jeremy Shaw
Sometimes we want to store very large collection types in RAM -- such as a Data.Map or Data.IxSet.
It seems like we could trade-off some speed for space savings by compressing the values in RAM.
Not knowing the actual data you want to store or the usage pattern I cannot quite say if these suggestions will be of use: * If the data is used in a read-only fashion (e.g. as a big lookup table,) consider using an SSTable (http://en.wikipedia.org/wiki/SSTable). The wiki page doesn't contain a lot of documentation but I think you can find one implemented in Cassandra. An SSTable is a sorted table from byte keys and byte values. It's highly memory efficient as the keys can be prefix encoded (as they are stored sorted) and whole blocks of the table can be compressed using e.g. Zippy. * If you need mutability you could consider using LevelDB. * If you need a pure data structure consider using a type-specialized version of the HAMT data structure which I use in the experimental hamt branch of unordered-containers (https://github.com/tibbe/unordered-containers). The HAMT has quite low overhead as is and if you specialize the key and value types (at the cost of lots of code duplication) you can decrease the per key/value overhead with another 4 words per such pair. -- Johan
participants (3)
-
Antoine Latter
-
Jeremy Shaw
-
Johan Tibell