Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

Bulat Ziganshin wrote:
Hello Andrew,
Sunday, July 8, 2007, 3:10:04 PM, you wrote:
Indeed. I'm more interested in which algorithms produce the best compression ratios than how to implement them fast.
algorithms you are tried so far (except for bwt+mtf) is the standard student's set :) of those, lzw is most advanced, but even this algorithm is 20-year old
Yeah, well, naturally I'm implementing the simple ones first. ;-) Actually, LZW works surprisingly well for such a trivial little algorithm... When you compare it to the complexity of an arithmetic coder driven by a high-order adaptive PPM model (theoretically the best general-purpose algorithm that exists), it's amazing that anything so simple as LZW should work at all!
Currently the code uses very general, flexible, polymorphic types all over the place, everything is normal lists, I'm using monadic parsing rather than bit-twiddling, etc etc etc. It goes alright with 100 KB files on my
testing today's algorithms on such small datasets don't make much sense because they need much more input to gather enough statistics about data compressed. i prefer hundreds-megabytes tests and don't know anyone with a test files smaller than 1 mb
OTOH, I'd like it to finish this month... (LZW is faster now I'm using Data.Map. But even so, 15 minutes to compress 640 KB of zeros? That's pretty slow!)
btw, in most cases, C algorithm with today's C compilers will work faster than asm oe - because you should have very good knowledge of COU internals to write efficient program. i dream about the days when the same will become true for Haskell
We all do... *sigh* (Or rather, most of us dream. A tiny few people seem to be trying to actually do stuff about it...)
i never heard about lzw+huf compressors :) there are two lz families - lz78 (lzw) which was popular in 80's and used variable-size words, and lz77 (lzss) which together with huffman was the most popular in 90's. compress (.Z) and pkzip 1 was lzw, while zip is lzh (i.e. lz77+huffman)
there is no much meaning to use huffman with lzw because many words will have only one or two occurrences
Hmm... that's interesting. I was *sure* it said that LHA used LZW+Huffman... oh well! You say that there would be "no meaning" to using a Huffman code to encode the output from an LZW stage. But consider this: The standard approach is for the LZW encoder to spit out 9 bits/symbol initially, and than 10 bits/symbol once the table fills up a bit more, and then 11 bits/symbol after that, and so on. (Usually until some hard-coded limit is reached. Typically the encoder just resets itself at that point or something.) So we're allocating more and more bits per symbol, but this is based only on how full the table is, *regardless* of whether any of these new symbols are even in fact *used*. ;-) Now, by using a Huffman code (scanning the actual produced output rather than working adaptively) we can avoid assigning bit combinations to unused symbols. (The downside of course is that now we need a Huffman table in the output - and any rare symbols might end up with rather long codewords. But remember: Huffman *guarantees* 0% compression or higher on the payload itself. A Huffman-compressed payload can *never* be bigger, only smaller or same size. So long as you can encode the Huffman table efficiently, you should be fine...)
btw, i invite you to the http://encode.ru/forums where you will find many compressor authors. you will be second one there who use haskell and first one who write compression algorithms in it :)
.ru = Russia?
OTOH, if the Gods behind GHC want to take a look and figure out whether there's any magic that can be added to the optimiser, I wouldn't complain... ;-)
(Realistically though. My program takes a [Word8] and turns it into a [Bool] before running a parser over it. The GHC optimiser doesn't really stand a hope in hell of optimising that into a program that reads a machine word into a CPU register and starts playing with bit flips on it...)
as time goes, those terrible compilers becomes smarter and smarter :)
Oh hey, I think GHC is already pretty smart. But no optimiser can ever hope to cover *every* possible case. And transforming [Bool] -> [Bool] into UArray Word8 -> UArray Word8 just seems a liiiiittle bit optimistic, methinks. ;-)
PPS. Does GHC make use of MMX, SSE, et al?
it can do it with via-c compilation and there are plans for native codegenerator too.
The way I heard it is that it *does* native code generation, but it's "not as good" as the via-C version. (Can't imagine why... You would have thought directly implementing functional primitives would be much easier than trying to mangle them to fit C's multitude of limitations. Still, what do I know?)
but it seems that you don't know asm and believe in advertising hype. i'm pretty sure that mmx can't help in algorithms you implementing and you are very far from level of performance where using mmx may give any improvement
MMX is probably useless. Some of the streaming stuff (with all the CPU cache twiddling and so on) might be useful for list processing. Frankly, it *all* seems so low-level that I can't begin to imagine how you could ever take advantage of this stuff without writing raw assembly by hand... (I read the IA32 manual one time. It was a long time ago though. I don't remember lots of detail. As for do I "know assembly"... I've programmed the 6502 and the M68000, although not extensively. My point is that I know my way round the inside of a von Neumann computers, that's all... ;-)

On Sun, Jul 08, 2007 at 04:07:59PM +0100, Andrew Coppin wrote:
The way I heard it is that it *does* native code generation, but it's "not as good" as the via-C version. (Can't imagine why... You would have thought directly implementing functional primitives would be much easier than trying to mangle them to fit C's multitude of limitations. Still, what do I know?)
This is very true - but -fvia-C isn't really C. First it generates GCC C (with global register variables, among other things), then it feeds the output through gcc - and then it runs a GHC-specific Perl script called the Evil Mangler over the assembly output, which promptly regexes your code to death. If you want to see what a difference exists between true C and the NCG, try compiling your program with -fvia-C -unreg. -unreg tells GHC to generate (something very close to) ANSI C; also, the mangler is not used. -unreg is mainly used for porting... Stefan

Hello Stefan, Sunday, July 8, 2007, 7:22:03 PM, you wrote:
This is very true - but -fvia-C isn't really C.
If you want to see what a difference exists between true C and the NCG, try compiling your program with -fvia-C -unreg
even better, try to to use jhc. it compiles via gcc and unlike ghc, it generates natural code. as result, it's only haskell compiler that is able to generate C-quality code: http://thread.gmane.org/gmane.comp.lang.haskell.glasgow.user/8827 -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Andrew, Sunday, July 8, 2007, 7:07:59 PM, you wrote:
Actually, LZW works surprisingly well for such a trivial little algorithm... When you compare it to the complexity of an arithmetic coder driven by a high-order adaptive PPM model (theoretically the best general-purpose algorithm that exists), it's amazing that anything so simple as LZW should work at all!
i don't think that ppm is so complex - it's just probability of symbol in some context. it's just too slow in naive implementation
there is no much meaning to use huffman with lzw because many words will have only one or two occurrences
(The downside of course is that now we need a Huffman table in the output - and any rare symbols might end up with rather long codewords. But remember: Huffman *guarantees* 0% compression or higher on the payload itself. A Huffman-compressed payload can *never* be bigger, only smaller or same size. So long as you can encode the Huffman table efficiently, you should be fine...)
the devil in details. just imagine size of huffman table with 64k entries :) huffman encoding is inappropriate for lzw output simply because most words will have only a few occurrences and economy on their optimal encoding doesn't justify price of their entries in the table
.ru = Russia?
of course
(Realistically though. My program takes a [Word8] and turns it into a [Bool] before running a parser over it. The GHC optimiser doesn't really stand a hope in hell of optimising that into a program that reads a machine word into a CPU register and starts playing with bit flips on it...)
as time goes, those terrible compilers becomes smarter and smarter :)
Oh hey, I think GHC is already pretty smart. But no optimiser can ever hope to cover *every* possible case. And transforming [Bool] -> [Bool] into UArray Word8 ->> UArray Word8 just seems a liiiiittle bit optimistic, methinks. ;-)
15 years ago i've written very smart asm program (btw, it was ARJ unpacker) with handmade function inlining, loop unrolling, register allocation, cpu recognition and so on. now, most of these tricks are standard for C compilers. times changes and now it's hard to imagine which optimizations will be available 10 years later
The way I heard it is that it *does* native code generation, but it's "not as good" as the via-C version. (Can't imagine why... You would have thought directly implementing functional primitives would be much easier than trying to mangle them to fit C's multitude of limitations. Still, what do I know?)
ghc's native and via-C modes are blind vs lame. in native mode, its codegenerator is comparable with 20 years-old C codegenerators. see above how much modern C compilers changed in these years. in via-C mode it generates unnatural C code which is hard to optimize for any C compiler. the jhc is very different story - it generates natural C code, so for simple, "low-level" programs its speed is the same as with C. but it lacks huge base of high-level GHC optimizations. as you may guess, in order to have C-like speed, compiler should implement both good high-level and low-level optimization. there are (low-priority) plans to provide LLVM backend for ghc which may solve this problem. but actually speed of generated code isn't number 1 priority for ghc users -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Andrew,
Sunday, July 8, 2007, 7:07:59 PM, you wrote:
i don't think that ppm is so complex - it's just probability of symbol in some context. it's just too slow in naive implementation
Oh, sure, the *idea* is simple enough. Trying to actually *implement* it correctly is something else... ;-) (Same statemenst go for arithmetic coding, really.)
(The downside of course is that now we need a Huffman table in the output - and any rare symbols might end up with rather long codewords. But remember: Huffman *guarantees* 0% compression or higher on the payload itself. A Huffman-compressed payload can *never* be bigger, only smaller or same size. So long as you can encode the Huffman table efficiently, you should be fine...)
the devil in details. just imagine size of huffman table with 64k entries :) huffman encoding is inappropriate for lzw output simply because most words will have only a few occurrences and economy on their optimal encoding doesn't justify price of their entries in the table
...which is why you need to "encode the Huffman table efficiently", to quote myself. ;-) Using canonical Huffman, you only actually need to know how many bits were assigned to each symbol. This information is probably very ameanable to RLE. (Which, incidentally, is why I started this whole "parser on top of a phaser" crazyness in the first place.) So, yeah, there may be 64k symbols - but if only 1k of them are ever *used*... ;-)
.ru = Russia?
of course
My Russian is very rusty. ;-)
Oh hey, I think GHC is already pretty smart. But no optimiser can ever hope to cover *every* possible case. And transforming [Bool] -> [Bool] into UArray Word8 ->> UArray Word8 just seems a liiiiittle bit optimistic, methinks. ;-)
15 years ago i've written very smart asm program (btw, it was ARJ unpacker) with handmade function inlining, loop unrolling, register allocation, cpu recognition and so on. now, most of these tricks are standard for C compilers. times changes and now it's hard to imagine which optimizations will be available 10 years later
Yes, but there are limits to what an optimiser can hope to accomplish. For example, you wouldn't implement a bubble sort and seriously expect the compiler to be able to "optimise" that into a merge sort, would you? ;-)
ghc's native and via-C modes are blind vs lame. in native mode, its codegenerator is comparable with 20 years-old C codegenerators. see above how much modern C compilers changed in these years. in via-C mode it generates unnatural C code which is hard to optimize for any C compiler.
I'll take your word for it. ;-) (I have made cursory attempts to comprehend the inner workings of GHC - but this is apparently drastically beyond my powers of comprehension.)
the jhc is very different story
Yes - last I heard, it's an experimental research project rather than a production-ready compiler...

On Monday 09 July 2007, Andrew Coppin wrote: <snip>
Yes, but there are limits to what an optimiser can hope to accomplish.
For example, you wouldn't implement a bubble sort and seriously expect the compiler to be able to "optimise" that into a merge sort, would you? ;-)
Something like it's been done; compilers can take in a quadratic time list reverse and substitute a linear time version: http://homepages.inf.ed.ac.uk/wadler/papers/vanish/vanish.ps.gz (pg. 2-3). One of the better all-time papers by the major Haskell researchers, I'd say. Jonathan Cast http://sourceforge.net/projects/fid-core http://sourceforge.net/projects/fid-emacs

On Mon, Jul 09, 2007 at 09:15:07PM +0100, Andrew Coppin wrote:
Bulat Ziganshin wrote:
Hello Andrew,
Sunday, July 8, 2007, 7:07:59 PM, you wrote:
i don't think that ppm is so complex - it's just probability of symbol in some context. it's just too slow in naive implementation
Oh, sure, the *idea* is simple enough. Trying to actually *implement* it correctly is something else... ;-)
Took me about an hour and 50 lines of code (about a year ago - this was one of my first Haskell programs) to implement a PPM (de)compressor that didn't crash, always generated the same output as input, and achieved 50% ratios on its own source code (not quite as good as gzip, but what do you expect from a completely untuned compressor?). Peak throughput: 2 bits / sec. :)
the jhc is very different story
Yes - last I heard, it's an experimental research project rather than a production-ready compiler...
Correct. It requires 5 minutes and 600MB of RAM to compile Hello, World, and fails with internal pattern match errors on anything significantly larger. Stefan

Stefan O'Rear wrote:
On Mon, Jul 09, 2007 at 09:15:07PM +0100, Andrew Coppin wrote:
Oh, sure, the *idea* is simple enough. Trying to actually *implement* it correctly is something else... ;-)
Took me about an hour and 50 lines of code (about a year ago - this was one of my first Haskell programs) to implement a PPM (de)compressor that didn't crash, always generated the same output as input, and achieved 50% ratios on its own source code (not quite as good as gzip, but what do you expect from a completely untuned compressor?).
...aren't you one of those insanely intelligent people working on GHC? :-P Actually, I wrote a working compressor in JavaScript. (You know, as an interactive web page. No, I don't still have it...) At least, I presume it works... I never had occasion to try to decompress the output! (I remember it had no underflow handling though...) Now, writing one that works in binary rather than decimal, and isn't absurdly inefficient? (I.e., it *doesn't* work by converting every thing to decimal strings, jiggling the characters around, and then parsing back to machine integers.) That is something I haven't managed yet...
Peak throughput: 2 bits / sec. :)
How fast can you click buttons? That was mine's peak output. ;-)
the jhc is very different story
Yes - last I heard, it's an experimental research project rather than a production-ready compiler...
Correct. It requires 5 minutes and 600MB of RAM to compile Hello, World, and fails with internal pattern match errors on anything significantly larger.
Ouch. That's pretty advanced... :-D

G'day all. Andrew Coppin wrote:
Actually, LZW works surprisingly well for such a trivial little algorithm... When you compare it to the complexity of an arithmetic coder driven by a high-order adaptive PPM model (theoretically the best general-purpose algorithm that exists), it's amazing that anything so simple as LZW should work at all!
Not really. LZW is basically PPM with a static (and flat!) frequency
prediction model. The contexts are build in a very similar way.
Quoting Bulat Ziganshin
the devil in details. just imagine size of huffman table with 64k entries :)
If you're allowed to pick your code values, you can do this extremely compactly. Some compression schemes optimised for English separates words from the intermediate space and assigns a Huffman code to each word. The tables aren't that big (though the space to hold the words might be). Cheers, Andrew Bromage

Hello ajb, Wednesday, July 11, 2007, 7:55:22 AM, you wrote:
Not really. LZW is basically PPM with a static (and flat!) frequency prediction model. The contexts are build in a very similar way.
what you mean by "flat" and "static" applied to PPM? static PPM models exist - they carry probabilities as separate table very like static Huffman encoding. is "flat" the same as order-0?
the devil in details. just imagine size of huffman table with 64k entries :)
If you're allowed to pick your code values, you can do this extremely compactly. Some compression schemes optimised for English separates words from the intermediate space and assigns a Huffman code to each word. The tables aren't that big (though the space to hold the words might be).
can you give a link? i never heard about such algorithm -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello ajb,
Wednesday, July 11, 2007, 7:55:22 AM, you wrote:
Not really. LZW is basically PPM with a static (and flat!) frequency prediction model. The contexts are build in a very similar way.
what you mean by "flat" and "static" applied to PPM? static PPM models exist - they carry probabilities as separate table very like static Huffman encoding. is "flat" the same as order-0?
I think he meant "flat" as in "pr(z) = 0 | 1"... As for "static", I'm not so sure that's actually correct.

G'day.
Quoting Bulat Ziganshin
what you mean by "flat" and "static" applied to PPM? static PPM models exist - they carry probabilities as separate table very like static Huffman encoding. is "flat" the same as order-0?
"Static" means that the frequency distribution is fixed. "Flat" means that the frequency histogram is flat. (Every code word is predicted to occur with the same frequency, resulting in a binary code.)
can you give a link? i never heard about such algorithm
http://en.wikipedia.org/wiki/Canonical_Huffman_code Cheers, Andrew Bromage

Hello ajb, Thursday, July 12, 2007, 9:25:35 AM, you wrote:
what you mean by "flat" and "static" applied to PPM? static PPM models exist - they carry probabilities as separate table very like static Huffman encoding. is "flat" the same as order-0?
"Static" means that the frequency distribution is fixed. "Flat" means that the frequency histogram is flat. (Every code word is predicted to occur with the same frequency, resulting in a binary code.)
well, ppm differs from simple order-0 coder in that it uses previous symbols to calculate probability. lzw differs from order-0 coder with flat static probabilities in that it encodes a whole word each time. there is nothing common between ppm and lzw except for this basic order-0 model, so i don't see any reasons to call lzw a sort of ppm. it will be more fair to say that lzw is order-0 coder with static flat probabilities, you agree?
can you give a link? i never heard about such algorithm
you said about algorithm that efficiently encodes english words using huffman codes. is such algorithm really exists? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

G'day.
Quoting Bulat Ziganshin
well, ppm differs from simple order-0 coder in that it uses previous symbols to calculate probability. lzw differs from order-0 coder with flat static probabilities in that it encodes a whole word each time.
LZW doesn't have the concept of a "word" at all. Both PPM and LZW build up contexts as they go by starting with one context for each symbol in the input alphabet, and by extending contexts with single symbols as they are found. (Conceptually; the algorithms are slightly different, but they do the same job.) The difference has to do, almost completely, with coding. LZW uses a degenerate model to decide what the next input symbol is. It assumes that all "seen" contexts are equally likely, and hence uses a binary code to encode them. (A "binary" code represents the numbers 0 to k-1 using k codewords each of which is (log k) bits in length.) PPM uses a different coding scheme, using the observed frequency of each context to drive an arithmetic coder.
there is nothing common between ppm and lzw except for this basic order-0 model, so i don't see any reasons to call lzw a sort of ppm.
There are a lot of differences between PPM and LZW, but they're based around the same principle: Build contexts based on what you've seen so far, and use that to predict what the next symbol should be.
it will be more fair to say that lzw is order-0 coder with static flat probabilities, you agree?
That's the _coding_ phase, sure. All known text compression algorithms basically work by building a model of the text, to try to predict the next symbol, and then using that to drive a coder which represents what was actually found using some bit representation. PPM and LZW use very different coders, but their models are quite similar.
you said about algorithm that efficiently encodes english words using huffman codes. is such algorithm really exists?
Ask Google about "word-based huffman". Cheers, Andrew Bromage

Hello ajb, Thursday, July 12, 2007, 12:16:22 PM, you wrote:
well, ppm differs from simple order-0 coder in that it uses previous symbols to calculate probability. lzw differs from order-0 coder with flat static probabilities in that it encodes a whole word each time.
LZW doesn't have the concept of a "word" at all.
in this case why you proposes to encode lzw words with a huffman codes? :)
Both PPM and LZW build up contexts as they go by starting with one context for each symbol in the input alphabet, and by extending contexts with single symbols as they are found. (Conceptually; the algorithms are slightly different, but they do the same job.)
The difference has to do, almost completely, with coding. LZW uses a degenerate model to decide what the next input symbol is. It assumes that all "seen" contexts are equally likely, and hence uses a binary code to encode them. (A "binary" code represents the numbers 0 to k-1 using k codewords each of which is (log k) bits in length.)
oops. ppm build tree of contexts and use context to encode one char. lzw builds dictionary of words and encode word in empty context. encoding in empty context is equal to order-0 coder while "prediction by partial matching" by definition is order-N, N>0 encoding. so lzw can't be considered as degenerate case of ppm although programming techniques (building tree of words/contexts) are pretty close
it will be more fair to say that lzw is order-0 coder with static flat probabilities, you agree?
That's the _coding_ phase, sure.
you are right. so lzw is just dictionary-based transformation which replaces some words with special codes while ppm replaces chars with their probabilities. i hope you will agree that ppm with flat codes will be totally useless
you said about algorithm that efficiently encodes english words using huffman codes. is such algorithm really exists?
Ask Google about "word-based huffman".
about which particular algorithm you said? Moffat? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 7/12/07, Bulat Ziganshin
about which particular algorithm you said? Moffat?
Well, both Andrew and I has Alistair Moffat as a lecturer in our time. So, surely. :-) If my memory serves me, you'll find Alistair has published work on quite a number of algorithms, some of which are symbol based (0, 1 or higher order), and others which are word based. cheers, T. -- Dr Thomas Conway drtomc@gmail.com Silence is the perfectest herald of joy: I were but little happy, if I could say how much.

G'day all.
Quoting Bulat Ziganshin
in this case why you proposes to encode lzw words with a huffman codes? :)
I don't. :-)
oops. ppm build tree of contexts and use context to encode one char. lzw builds dictionary of words and encode word in empty context.
As you noted later, the "dictionary of words" in LZW is also tree- structured. "Words", as you call them, are built by extending words with single characters, in pretty much exactly the same way that PPM contexts are built by extending contexts with single characters. The main difference is this: To encode a "word" in PPM, you encode the conditional probability that it takes to get to the end of the word from the start of the word. It looks like you're encoding single characters (and, programmatically, you are), but because of the way that arithmetic coding works, it's mathematically equivalent to encoding the probability of finding the word as a whole. You're just decomposing the probability of finding the word into the probabilities of finding the individual input symbols along the way. Does that make sense?
you are right. so lzw is just dictionary-based transformation which replaces some words with special codes while ppm replaces chars with their probabilities. i hope you will agree that ppm with flat codes will be totally useless
Absolutely. But augmenting LZW with probabilities to allow for arithmetic coding wouldn't be so dumb, if it weren't for the fact that a) we already have more efficient compression algorithms, and b) it'd destroy the main benefit of LZW (which is that it's effective on slow CPUs with small memories; perfect for 80s-era micros and 8-bit embedded systems).
about which particular algorithm you said? Moffat?
Yes, though if your local university library has a copy, "Managing Gigabytes" might be a better introduction to the topic. Cheers, Andrew Bromage

Hello ajb, Friday, July 13, 2007, 4:50:15 AM, you wrote:
To encode a "word" in PPM, you encode the conditional probability that it takes to get to the end of the word from the start of the word. It looks like you're encoding single characters (and, programmatically, you are), but because of the way that arithmetic coding works, it's mathematically equivalent to encoding the probability of finding the word as a whole. You're just decomposing the probability of finding the word into the probabilities of finding the individual input symbols along the way.
Does that make sense?
it seems that you either misunderstood PPM or make too wide assumptions. in classical fixed-order ppm each char probability found in context of previous N chars. i.e. when encoding abcdef.. with order 3, 'd' encoded in context of 'abc', 'e' encoded in context of 'bcd' and so on when you consider lzw as a sort of ppm, context order is grown from 0 up to size of phrase, as shown in lzrw5 paper (http://www.cbloom.com/src/lzrw.zip) so you may consider lzw either as "flat static", i.e. order -1 encoding of *words* or floating-order encoding of chars. you just mixed up two ideas. don't forget that order -1 coder isn't ppm - it just a trivial copying. as i already said, using of tries for model representation means similarity in implementation techniques but not in mathematical properties regarding lzw+huffman - i think that key to efficient huffman encoding of English words was sorting dictionary by word frequency, which is inappropriate for lzw. also, unlike English dictionary, lzw dictionary grows dynamically which means that static huffman techniques will lose part of redundancy -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

G'day all.
Quoting Bulat Ziganshin
it seems that you either misunderstood PPM or make too wide assumptions.
More likely I'm not explaining myself well.
in classical fixed-order ppm each char probability found in context of previous N chars. i.e. when encoding abcdef.. with order 3, 'd' encoded in context of 'abc', 'e' encoded in context of 'bcd' and so on
Let's suppose that "abcdef" is a context in PPM, and we're starting from the "start" state (which usually isn't the case, of course). Then to encode it, you encode the probability of finding an "a" in the start state, followed by the probability of finding a "b" in the "a" state, followed by the probability of finding a "c" in the "ab" state... The effect of encoding these multiple symbols is equivalent to the effect of encoding a single symbol, whose probability is the individual probabilities multiplied together. Or, if you like, it's the conditional probability of encoding a string starting with "abcdef" from the start state, taken as a proportion of _all_ possible encoded strings. Pr("abcdef"|"") = Pr("a"|"") * Pr("b"|"a") * Pr("c"|"ab" ) * ... LZW uses a simpler model, estimating Pr("abcdef"|S) as 1/N, where N is the number of "states" seen so far. Cheers, Andrew Bromage

Hello ajb, Friday, July 13, 2007, 12:10:54 PM, you wrote:
it seems that you either misunderstood PPM or make too wide assumptions.
More likely I'm not explaining myself well.
in classical fixed-order ppm each char probability found in context of previous N chars. i.e. when encoding abcdef.. with order 3, 'd' encoded in context of 'abc', 'e' encoded in context of 'bcd' and so on
Let's suppose that "abcdef" is a context in PPM, and we're starting from the "start" state (which usually isn't the case, of course). Then to encode it, you encode the probability of finding an "a" in the start state, followed by the probability of finding a "b" in the "a" state, followed by the probability of finding a "c" in the "ab" state...
sorry, but you really misunderstand how PPM works and why it's so efficient. it searches probability of each symbol in context of previous N chars where N is fixed. lzw can be directly compared with ppm modification you described here, but this modification isn't ppm itself. also, because ppm encodes each symbol separately, making its probabilities flat effectively disables compression - the key why flat probabilities still work in lzw is that it encodes whole word with this flat model -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

G'day.
Quoting Bulat Ziganshin
sorry, but you really misunderstand how PPM works and why it's so efficient.
I understand PPM. I really do. I think what you don't understand is why LZW does as well as it does considering how simple it is. It's because it's a poor but cheap approximation to what PPM does. Part of the misunderstanding is that you're looking at what PPM does algorithmically, whereas I'm talking about what it's equivalent to mathematically. It turns out that pretty much all known text compression algorithms are "equivalent" in the sense that they give the same compression ration on the same text to within a (possibly large) constant factor. BZip, for example, gives a compression ratio that's almost always within 5% or so of the best PPM variants given an input text. How could that be, since they're so different? The answer is that they're actually _not_ that different, mathematically speaking. Cheers, Andrew Bromage

Hello ajb, Thursday, July 12, 2007, 12:16:22 PM, you wrote:
well, ppm differs from simple order-0 coder in that it uses previous symbols to calculate probability. lzw differs from order-0 coder with flat static probabilities in that it encodes a whole word each time.
oh, i just recalled that flat distribution conventionally called order -1 model. so starting from non-compressible order -1 model, lzw and ppm goes into opposite direction: * ppm uses several previous chars to predict one next character * lz77/78 predicts (in the order -1 model) several next chars at once if you interested, there are many algorithms combining both approaches: lzrw5/rolz, lzp, lzcb, lzma -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

ajb@spamcop.net wrote:
G'day all.
Andrew Coppin wrote:
Actually, LZW works surprisingly well for such a trivial little algorithm... When you compare it to the complexity of an arithmetic coder driven by a high-order adaptive PPM model (theoretically the best general-purpose algorithm that exists), it's amazing that anything so simple as LZW should work at all!
Not really. LZW is basically PPM with a static (and flat!) frequency prediction model. The contexts are build in a very similar way.
Yes - but making it use a non-flat model opens a whole Pandora's Box of fiddly programming. ;-) (It's not "hard" as such - just frustratingly fiddly to get correct...)

On 7/12/07, Andrew Coppin
Yes - but making it use a non-flat model opens a whole Pandora's Box of fiddly programming. ;-)
This could just about be Rule No 1 of haskell programming: if it's fiddly, then you haven't thought about the problem hard enough. Corollary No 1 is Any Expression requiring more than 80 columns is fiddly. :-) I say this in jest, but it is "ha ha, only serious". cheers, T. -- Dr Thomas Conway drtomc@gmail.com Silence is the perfectest herald of joy: I were but little happy, if I could say how much.

Thomas Conway wrote:
On 7/12/07, Andrew Coppin
wrote: Yes - but making it use a non-flat model opens a whole Pandora's Box of fiddly programming. ;-)
This could just about be Rule No 1 of haskell programming: if it's fiddly, then you haven't thought about the problem hard enough.
Corollary No 1 is Any Expression requiring more than 80 columns is fiddly.
:-)
I say this in jest, but it is "ha ha, only serious".
LOL! Probably... The idea behind PPM is very simple and intuitive. But working out all the probabilities such that they always sum to 100% is surprisingly hard. :-(

Hello Andrew, Thursday, July 12, 2007, 10:06:52 PM, you wrote:
The idea behind PPM is very simple and intuitive. But working out all the probabilities such that they always sum to 100% is surprisingly hard. :-(
why? you just add probabilities of all symbols in this context and add 1 for escape symbol -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (6)
-
ajb@spamcop.net
-
Andrew Coppin
-
Bulat Ziganshin
-
Jonathan Cast
-
Stefan O'Rear
-
Thomas Conway