
andrewcoppin:
Andrew Coppin wrote:
Dave Bayer wrote:
I was beginning to accept that I might die before clearing my pipeline of research projects I want to code up. Haskell has given me new hope.
Indeed. ;-)
Today I hve implemented encoders and decoders for RLE, MTF, Fibonacci codes, and LZW. Next on my list is BWT, Huffman codes and arithmetic coding. (That last is *very* hard though...)
In case anybody cares...
darcs get http://www.orphi.me.uk/darcs/ToyCompression ghc -O2 --make Encode ghc -O2 --make Decode
You can now do
Encode LZW foo
Will look for a file named "foo", and generate a file called "foo-LZW" containing the LZW-compressed version of it. Also "foo-LZW.log.txt", which contains some statistics.
Similarly,
Decode LZW foo-LZW
will make a file "foo-LZW-unLZW", which should *hopefully* be identical to the original one. (!)
It didn't occur to me, but I should probably go add some way to discover a list of available algorithms! LOL. Anyway, currently working:
RLE (Run Length Encoding). Counts runs of symbols. Considers 8 bits to be one "symbol". The maximum run length is 255. MTF (Move To Front). Takes an input file and produces an output file of exactly the same size, but containing mostly *low* numbers. Works well with RLE or Fib. BWT (Burners-Wheeler Transform). Like MTF, does no actual compression, but makes the file more "compressible". Fib (Fibonacci codes). Stores numbers using Fibonacci codes instead of unsigned binary. This makes low numbers smaller and high numbers bigger. (Use it with MTF...) LZW (Lempel-Ziv-Welch). The daddy. Looks for repeated input strings and compresses them.
Caution: Encoding or decoding BWT currently requires absurd amounts of time and space. (Like, > 2 GB RAM and 8 minutes wall time to process 12 KB of text.) I hope to fix that soon...
Currently it's unclear whether LZW or BWT+MTF+Fib gives the best compression. (Mainly because BWT is so hard to run!) I hope to implement a few other algorithms soonly.
(Note: LZW is typically used with a variable number of bits per output symbol. I haven't done this yet. It simply uses 16 bits in all cases. Once I fix this, compression will go up. Also, once the symbol dictionary is full, the encoder just resets itself.)
Good work. Probably worth benchmarking against the other compression libraries on hackage, just to get a sense for how well your implementations are optimised (and maybe how to best interact with lazy bytestrings?). See for example: zlib http://hackage.haskell.org/cgi-bin/hackage-scripts/package/zlib-0.3 bzlib http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bzlib-0.3 and also: lzf http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Codec-Compression... -- Don