Re[2]: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

9 Jul 2007

      Hello Andrew,

Sunday, July 8, 2007, 7:07:59 PM, you wrote:
...
Actually, LZW works surprisingly well for such a trivial little
algorithm... When you compare it to the complexity of an arithmetic 
coder driven by a high-order adaptive PPM model (theoretically the best
general-purpose algorithm that exists), it's amazing that anything so 
simple as LZW should work at all!
i don't think that ppm is so complex - it's just probability of
symbol in some context. it's just too slow in naive implementation
...
...
there is no much meaning to use huffman with lzw because many words
will have only one or two occurrences
...
(The downside of course is that now we need a Huffman table in the 
output - and any rare symbols might end up with rather long codewords.
But remember: Huffman *guarantees* 0% compression or higher on the 
payload itself. A Huffman-compressed payload can *never* be bigger, only
smaller or same size. So long as you can encode the Huffman table 
efficiently, you should be fine...)
the devil in details. just imagine size of huffman table with 64k
entries :)  huffman encoding is inappropriate for lzw output simply
because most words will have only a few occurrences and economy on
their optimal encoding doesn't justify price of their entries in the table
...
.ru = Russia?
of course
...
...
...
(Realistically though. My program takes a [Word8] and turns it into a
[Bool] before running a parser over it. The GHC optimiser doesn't really
stand a hope in hell of optimising that into a program that reads a 
machine word into a CPU register and starts playing with bit flips on it...)
as time goes, those terrible compilers becomes smarter and smarter :)
...
Oh hey, I think GHC is already pretty smart. But no optimiser can ever
hope to cover *every* possible case. And transforming [Bool] -> [Bool]
into UArray Word8 ->> UArray Word8 just seems a liiiiittle bit
optimistic, methinks. ;-)
15 years ago i've written very smart asm program (btw, it was ARJ
unpacker) with handmade function inlining, loop unrolling, register
allocation, cpu recognition and so on. now, most of these tricks are
standard for C compilers. times changes and now it's hard to imagine which
optimizations will be available 10 years later
...
The way I heard it is that it *does* native code generation, but it's
"not as good" as the via-C version. (Can't imagine why... You would have
thought directly implementing functional primitives would be much easier
than trying to mangle them to fit C's multitude of limitations. Still,
what do I know?)
ghc's native and via-C modes are blind vs lame. in native mode, its
codegenerator is comparable with 20 years-old C codegenerators. see
above how much modern C compilers changed in these years. in via-C
mode it generates unnatural C code which is hard to optimize for any C
compiler. the jhc is very different story - it generates natural C
code, so for simple, "low-level" programs its speed is the same as
with C. but it lacks huge base of high-level GHC optimizations. as you
may guess, in order to have C-like speed, compiler should implement
both good high-level and low-level optimization. there are
(low-priority) plans to provide LLVM backend for ghc which may solve
this problem. but actually speed of generated code isn't number 1
priority for ghc users

-- 
Best regards,
 Bulat                            mailto:Bulat.Ziganshin@gmail.com

Re[2]: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

Bulat Ziganshin