Compressed Data.Map for more efficient RAM usage?

16 Feb 2012

      Sometimes we  want to store very large collection types in RAM -- such as a
Data.Map or Data.IxSet.

It seems like we could trade-off some speed for space savings by
compressing the values in RAM.

Lemmih has previously created compact-map:

 http://hackage.haskell.org/package/compact-map

which mightybyte used to create:

https://github.com/mightybyte/compact-ixset

compact-map converts the Haskell values to a more compact binary
representation. But could we do even better by compressing the bytestrings?

Here is a useless implementation:

http://hpaste.org/63836

That version compresses each value by itself. But, that is probably going
to be worse RAM usage, not better. I think what we really need to do is to
store a dictionary in CMap that is used to compress all the ByteString
values. That way if the same string appears in 10000 entries, they can all
be compressed using the same compression code.

One question is, how much would this benefit us?

As an experiment I tried compressing one checkpoint file for real world
acid-state app that I am involved with. A checkpoint file contains binary
data that is serialized by the SafeCopy library. And it is data that is
currently store in an IxSet in RAM.

The uncompressed checkpoint file is 115MB. The compressed checkpoint file
is 26MB.

A checkpoint file probably contains more redundancy than an equivalent
compressed IxSet because in addition to the values it also includes version
tags and other easily compressed data. But, those numbers do look
promising. At the very least, they do not rule out the possibility of some
big savings.

I do not have time to explore this anymore. But I think it could be a fun
project for someone to work on. ;)

I imagine a prototype with a shared dictionary could be hacked up in a few
days. A bit more work would be required to document it, memory profile it,
etc.

If done well, we should be able to reuse the incremental compression code
in Data.Map, Data.Set, Data.IxSet, HiggsSet, and other places. But, just
focusing on Data.Map is a good start. I will create a bug in this bug
tracker that links back to this message,

http://code.google.com/p/happstack/issues/list

- jeremy

Jeremy Shaw

Antoine Latter

Johan Tibell

Antoine Latter

Johan Tibell

tags

participants (3)