
Bulat Ziganshin wrote:
Hello Andrew,
Sunday, July 8, 2007, 3:10:04 PM, you wrote:
Indeed. I'm more interested in which algorithms produce the best compression ratios than how to implement them fast.
algorithms you are tried so far (except for bwt+mtf) is the standard student's set :) of those, lzw is most advanced, but even this algorithm is 20-year old
Yeah, well, naturally I'm implementing the simple ones first. ;-) Actually, LZW works surprisingly well for such a trivial little algorithm... When you compare it to the complexity of an arithmetic coder driven by a high-order adaptive PPM model (theoretically the best general-purpose algorithm that exists), it's amazing that anything so simple as LZW should work at all!
Currently the code uses very general, flexible, polymorphic types all over the place, everything is normal lists, I'm using monadic parsing rather than bit-twiddling, etc etc etc. It goes alright with 100 KB files on my
testing today's algorithms on such small datasets don't make much sense because they need much more input to gather enough statistics about data compressed. i prefer hundreds-megabytes tests and don't know anyone with a test files smaller than 1 mb
OTOH, I'd like it to finish this month... (LZW is faster now I'm using Data.Map. But even so, 15 minutes to compress 640 KB of zeros? That's pretty slow!)
btw, in most cases, C algorithm with today's C compilers will work faster than asm oe - because you should have very good knowledge of COU internals to write efficient program. i dream about the days when the same will become true for Haskell
We all do... *sigh* (Or rather, most of us dream. A tiny few people seem to be trying to actually do stuff about it...)
i never heard about lzw+huf compressors :) there are two lz families - lz78 (lzw) which was popular in 80's and used variable-size words, and lz77 (lzss) which together with huffman was the most popular in 90's. compress (.Z) and pkzip 1 was lzw, while zip is lzh (i.e. lz77+huffman)
there is no much meaning to use huffman with lzw because many words will have only one or two occurrences
Hmm... that's interesting. I was *sure* it said that LHA used LZW+Huffman... oh well! You say that there would be "no meaning" to using a Huffman code to encode the output from an LZW stage. But consider this: The standard approach is for the LZW encoder to spit out 9 bits/symbol initially, and than 10 bits/symbol once the table fills up a bit more, and then 11 bits/symbol after that, and so on. (Usually until some hard-coded limit is reached. Typically the encoder just resets itself at that point or something.) So we're allocating more and more bits per symbol, but this is based only on how full the table is, *regardless* of whether any of these new symbols are even in fact *used*. ;-) Now, by using a Huffman code (scanning the actual produced output rather than working adaptively) we can avoid assigning bit combinations to unused symbols. (The downside of course is that now we need a Huffman table in the output - and any rare symbols might end up with rather long codewords. But remember: Huffman *guarantees* 0% compression or higher on the payload itself. A Huffman-compressed payload can *never* be bigger, only smaller or same size. So long as you can encode the Huffman table efficiently, you should be fine...)
btw, i invite you to the http://encode.ru/forums where you will find many compressor authors. you will be second one there who use haskell and first one who write compression algorithms in it :)
.ru = Russia?
OTOH, if the Gods behind GHC want to take a look and figure out whether there's any magic that can be added to the optimiser, I wouldn't complain... ;-)
(Realistically though. My program takes a [Word8] and turns it into a [Bool] before running a parser over it. The GHC optimiser doesn't really stand a hope in hell of optimising that into a program that reads a machine word into a CPU register and starts playing with bit flips on it...)
as time goes, those terrible compilers becomes smarter and smarter :)
Oh hey, I think GHC is already pretty smart. But no optimiser can ever hope to cover *every* possible case. And transforming [Bool] -> [Bool] into UArray Word8 -> UArray Word8 just seems a liiiiittle bit optimistic, methinks. ;-)
PPS. Does GHC make use of MMX, SSE, et al?
it can do it with via-c compilation and there are plans for native codegenerator too.
The way I heard it is that it *does* native code generation, but it's "not as good" as the via-C version. (Can't imagine why... You would have thought directly implementing functional primitives would be much easier than trying to mangle them to fit C's multitude of limitations. Still, what do I know?)
but it seems that you don't know asm and believe in advertising hype. i'm pretty sure that mmx can't help in algorithms you implementing and you are very far from level of performance where using mmx may give any improvement
MMX is probably useless. Some of the streaming stuff (with all the CPU cache twiddling and so on) might be useful for list processing. Frankly, it *all* seems so low-level that I can't begin to imagine how you could ever take advantage of this stuff without writing raw assembly by hand... (I read the IA32 manual one time. It was a long time ago though. I don't remember lots of detail. As for do I "know assembly"... I've programmed the 6502 and the M68000, although not extensively. My point is that I know my way round the inside of a von Neumann computers, that's all... ;-)