Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

8 Jul 2007

      Bulat Ziganshin wrote:
...
Hello Andrew,
Sunday, July 8, 2007, 3:10:04 PM, you wrote:
...
Indeed. I'm more interested in which algorithms produce the best
compression ratios than how to implement them fast.
algorithms you are tried so far (except for bwt+mtf) is the standard
student's set :)  of those, lzw is most advanced, but even this
algorithm is 20-year old
Yeah, well, naturally I'm implementing the simple ones first. ;-)

Actually, LZW works surprisingly well for such a trivial little 
algorithm... When you compare it to the complexity of an arithmetic 
coder driven by a high-order adaptive PPM model (theoretically the best 
general-purpose algorithm that exists), it's amazing that anything so 
simple as LZW should work at all!
...
...
Currently the code
uses very general, flexible, polymorphic types all over the place, 
everything is normal lists, I'm using monadic parsing rather than 
bit-twiddling, etc etc etc. It goes alright with 100 KB files on my
testing today's algorithms on such small datasets don't make much
sense because they need much more input to gather enough statistics
about data compressed. i prefer hundreds-megabytes tests and don't
know anyone with a test files smaller than 1 mb
OTOH, I'd like it to finish this month...

(LZW is faster now I'm using Data.Map. But even so, 15 minutes to 
compress 640 KB of zeros? That's pretty slow!)
...
btw, in most cases, C algorithm with today's C compilers will work
faster than asm oe - because you should have very good knowledge of
COU internals to write efficient program. i dream about the days when
the same will become true for Haskell
We all do... *sigh*

(Or rather, most of us dream. A tiny few people seem to be trying to 
actually do stuff about it...)
...
i never heard about lzw+huf compressors :)  there are two lz families
- lz78 (lzw) which was popular in 80's and used variable-size words,
and lz77 (lzss) which together with huffman was the most popular in
90's. compress (.Z) and pkzip 1 was lzw, while zip is lzh (i.e.
lz77+huffman)
there is no much meaning to use huffman with lzw because many words
will have only one or two occurrences
Hmm... that's interesting. I was *sure* it said that LHA used 
LZW+Huffman... oh well!

You say that there would be "no meaning" to using a Huffman code to 
encode the output from an LZW stage. But consider this: The standard 
approach is for the LZW encoder to spit out 9 bits/symbol initially, and 
than 10 bits/symbol once the table fills up a bit more, and then 11 
bits/symbol after that, and so on. (Usually until some hard-coded limit 
is reached. Typically the encoder just resets itself at that point or 
something.) So we're allocating more and more bits per symbol, but this 
is based only on how full the table is, *regardless* of whether any of 
these new symbols are even in fact *used*. ;-) Now, by using a Huffman 
code (scanning the actual produced output rather than working 
adaptively) we can avoid assigning bit combinations to unused symbols.

(The downside of course is that now we need a Huffman table in the 
output - and any rare symbols might end up with rather long codewords. 
But remember: Huffman *guarantees* 0% compression or higher on the 
payload itself. A Huffman-compressed payload can *never* be bigger, only 
smaller or same size. So long as you can encode the Huffman table 
efficiently, you should be fine...)
...
btw, i invite you to the http://encode.ru/forums where you will find
many compressor authors. you will be second one there who use haskell
and first one who write compression algorithms in it :)
.ru = Russia?
...
...
OTOH, if the Gods behind GHC want to take a look and figure out whether
there's any magic that can be added to the optimiser, I wouldn't 
complain... ;-)
(Realistically though. My program takes a [Word8] and turns it into a 
[Bool] before running a parser over it. The GHC optimiser doesn't really
stand a hope in hell of optimising that into a program that reads a 
machine word into a CPU register and starts playing with bit flips on it...)
as time goes, those terrible compilers becomes smarter and smarter :)
Oh hey, I think GHC is already pretty smart. But no optimiser can ever 
hope to cover *every* possible case. And transforming [Bool] -> [Bool] 
into UArray Word8 -> UArray Word8 just seems a liiiiittle bit 
optimistic, methinks. ;-)
...
...
PPS. Does GHC make use of MMX, SSE, et al?
it can do it with via-c compilation and there are plans for native
codegenerator too.
The way I heard it is that it *does* native code generation, but it's 
"not as good" as the via-C version. (Can't imagine why... You would have 
thought directly implementing functional primitives would be much easier 
than trying to mangle them to fit C's multitude of limitations. Still, 
what do I know?)
...
but it seems that you don't know asm and believe in
advertising hype. i'm pretty sure that mmx can't help in algorithms
you implementing and you are very far from level of performance where
using mmx may give any improvement
MMX is probably useless. Some of the streaming stuff (with all the CPU 
cache twiddling and so on) might be useful for list processing. Frankly, 
it *all* seems so low-level that I can't begin to imagine how you could 
ever take advantage of this stuff without writing raw assembly by hand...

(I read the IA32 manual one time. It was a long time ago though. I don't 
remember lots of detail. As for do I "know assembly"... I've programmed 
the 6502 and the M68000, although not extensively. My point is that I 
know my way round the inside of a von Neumann computers, that's all... ;-)

Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

Andrew Coppin