Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

8 Jul 2007


      andrewcoppin:
...
Andrew Coppin wrote:
...
Dave Bayer wrote:
...
I was beginning to accept that I might die before clearing my 
pipeline of research projects I want to code up. Haskell has given me 
new hope.
Indeed. ;-)
Today I hve implemented encoders and decoders for RLE, MTF, Fibonacci 
codes, and LZW. Next on my list is BWT, Huffman codes and arithmetic 
coding. (That last is *very* hard though...)
In case anybody cares...
darcs get http://www.orphi.me.uk/darcs/ToyCompression
  ghc -O2 --make Encode
  ghc -O2 --make Decode
You can now do
Encode LZW foo
Will look for a file named "foo", and generate a file called "foo-LZW" 
containing the LZW-compressed version of it. Also "foo-LZW.log.txt", 
which contains some statistics.
Similarly,
Decode LZW foo-LZW
will make a file "foo-LZW-unLZW", which should *hopefully* be identical 
to the original one. (!)
It didn't occur to me, but I should probably go add some way to discover 
a list of available algorithms! LOL. Anyway, currently working:
RLE (Run Length Encoding). Counts runs of symbols. Considers 8 bits 
to be one "symbol". The maximum run length is 255.
  MTF (Move To Front). Takes an input file and produces an output file 
of exactly the same size, but containing mostly *low* numbers. Works 
well with RLE or Fib.
  BWT (Burners-Wheeler Transform). Like MTF, does no actual 
compression, but makes the file more "compressible".
  Fib (Fibonacci codes). Stores numbers using Fibonacci codes instead 
of unsigned binary. This makes low numbers smaller and high numbers 
bigger. (Use it with MTF...)
  LZW (Lempel-Ziv-Welch). The daddy. Looks for repeated input strings 
and compresses them.
Caution: Encoding or decoding BWT currently requires absurd amounts of 
time and space. (Like, > 2 GB RAM and 8 minutes wall time to process 12 
KB of text.) I hope to fix that soon...
Currently it's unclear whether LZW or BWT+MTF+Fib gives the best 
compression. (Mainly because BWT is so hard to run!) I hope to implement 
a few other algorithms soonly.
(Note: LZW is typically used with a variable number of bits per output 
symbol. I haven't done this yet. It simply uses 16 bits in all cases. 
Once I fix this, compression will go up. Also, once the symbol 
dictionary is full, the encoder just resets itself.)
Good work. Probably worth benchmarking against the other compression
libraries on hackage, just to get a sense for how well your
implementations are optimised (and maybe how to best interact with lazy
bytestrings?). See for example:

    zlib   http://hackage.haskell.org/cgi-bin/hackage-scripts/package/zlib-0.3
    bzlib  http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bzlib-0.3

and also:
    lzf    http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Codec-Compression...

-- Don

Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

dons＠cse.unsw.edu.au