Using Data.Binary for compression

Chad Scherrer

15 Nov 2007 15 Nov '07

6:03 a.m.

Hi, I'd like to be able to use Data.Binary (or similar) for compression. Say I have an abstract type Symbol, and for each value of Symbol I have a representation in terms of some number of bits. For compression to be efficient, commonly-used Symbols should have very short representations, while less common ones can be longer. Since an encoding like [Bool] would be really inefficient for this (at least I think it would, though some fancy fusion tricks might be able to help), I was thinking a reasonable approach might be to use Word8 (for example), and then specify a number of bits n, indicating that only the first n bits are to be written to the compressed representation. I was looking at the internals of Data.Binary, and saw it seems that PutM could be used for this purpose (was something like this its original purpose?). Today, I put this together: ------------------------------------ type BitRep = Word8 type NBits = Int type MyBits = (BitRep, NBits) (#) :: MyBits -> MyBits -> PutM MyBits (a, m) # (b, n) = case (a .|. (b `shiftR` m), m + n) of ans@(ab, s) -> if s < 8 then return ans else putWord8 ab >> return (b `shiftL` (8 - m), s - 8) ------------------------------------ Then, it would be easy enough to map [Symbol] -> [MyBits], and then use something like foldM (#) to get into the PutM monad. A couple of questions: (1) Am I reinventing the wheel? I haven't seen anything like this, but it would be nice to be a bit more certain. (2) This seems like it will work ok, but the feel is not as clean as the current Data.Binary interface. Is there something I'm missing that might make it easier to integrate this? (3) Right now this is just proof of concept, but eventually I'd like to do some performance tuning, and it would be nice to have a representation that's amenable to this. Any thoughts on speeding this up while keeping the interface reasonably clean would be much appreciated. Thanks! -Chad

Show replies by date

Stefan O'Rear

15 Nov 15 Nov

6:13 a.m.

On Wed, Nov 14, 2007 at 10:03:52PM -0800, Chad Scherrer wrote:

...

Hi,

I'd like to be able to use Data.Binary (or similar) for compression. Say I have an abstract type Symbol, and for each value of Symbol I have a representation in terms of some number of bits. For compression to be efficient, commonly-used Symbols should have very short representations, while less common ones can be longer. ... (1) Am I reinventing the wheel? I haven't seen anything like this, but it would be nice to be a bit more certain.

(2) This seems like it will work ok, but the feel is not as clean as the current Data.Binary interface. Is there something I'm missing that might make it easier to integrate this?

(3) Right now this is just proof of concept, but eventually I'd like to do some performance tuning, and it would be nice to have a representation that's amenable to this. Any thoughts on speeding this up while keeping the interface reasonably clean would be much appreciated.

Almost all 'real users' just use Codec.Compression.GZip. It's very fast, very compositional, and (perhaps suprisingly) almost as effective as application-specific schemes. Stefan

Don Stewart

6:21 a.m.

stefanor:

...

On Wed, Nov 14, 2007 at 10:03:52PM -0800, Chad Scherrer wrote:

...
Hi,

I'd like to be able to use Data.Binary (or similar) for compression. Say I have an abstract type Symbol, and for each value of Symbol I have a representation in terms of some number of bits. For compression to be efficient, commonly-used Symbols should have very short representations, while less common ones can be longer. ... (1) Am I reinventing the wheel? I haven't seen anything like this, but it would be nice to be a bit more certain.

(2) This seems like it will work ok, but the feel is not as clean as the current Data.Binary interface. Is there something I'm missing that might make it easier to integrate this?

(3) Right now this is just proof of concept, but eventually I'd like to do some performance tuning, and it would be nice to have a representation that's amenable to this. Any thoughts on speeding this up while keeping the interface reasonably clean would be much appreciated.

Almost all 'real users' just use Codec.Compression.GZip. It's very fast, very compositional, and (perhaps suprisingly) almost as effective as application-specific schemes.

I was about to say the same thing. So so much simpler to use Duncan's carefully written zlib binding, import Data.Binary import Codec.Compression.GZip import qualified Data.ByteString.Lazy as L main = L.writeFile "log.gz" . compress . encode $ [1..10::Int] Simple, purely functional, fast. -- Don

Dominic Steinitz

8:37 a.m.

Don Stewart writes:

...

I was about to say the same thing. So so much simpler to use Duncan's carefully written zlib binding,

import Data.Binary import Codec.Compression.GZip import qualified Data.ByteString.Lazy as L

main = L.writeFile "log.gz" . compress . encode $ [1..10::Int]

Simple, purely functional, fast.

Don, But something like this is needed (or I least I would like it). I'd like functions to get and put bits as in NewBinary. Is this not a way of doing it? Dominic.

Chad Scherrer

7:10 p.m.

...

...
Almost all 'real users' just use Codec.Compression.GZip. It's very fast, very compositional, and (perhaps suprisingly) almost as effective as application-specific schemes.

I was about to say the same thing. So so much simpler to use Duncan's carefully written zlib binding,

import Data.Binary import Codec.Compression.GZip import qualified Data.ByteString.Lazy as L

main = L.writeFile "log.gz" . compress . encode $ [1..10::Int]

Simple, purely functional, fast.

-- Don

I have several types of symbols, and for each type the probabilities are very predictable - to the point where they could even be hard-coded. And upon completion I can be sure the first two questions will be "Can we make it smaller?" and "Can we make it faster?". GZip (while very cool) is adaptive and general-purpose, so it's building frequency tables as it goes and ignoring the structure of the data I should be able to take advantage of. With an awful lot of trouble, it must be possible to write something in C to go faster and yield better compression than gzip for this particular data. With the probability structure known in advance, there are just a lot of steps taken by gzip that are no longer needed. Besides this, gzip only assumes an arbitrary sequence of bytes, but my data are much more structured than this. Considering the high performance achieved using idiomatic Haskell and the ByteString and Binary libraries, I would think a similar approach could be used for writing individual bits. Then it would be (relatively) easy to write custom compression routines in Haskell with reasonable performance - I don't think this can be said of any other language. Chad

David Roundy

7:21 p.m.

On Thu, Nov 15, 2007 at 11:10:01AM -0800, Chad Scherrer wrote:

...

...
...
Almost all 'real users' just use Codec.Compression.GZip. It's very fast, very compositional, and (perhaps suprisingly) almost as effective as application-specific schemes.

I was about to say the same thing. So so much simpler to use Duncan's carefully written zlib binding,

I have several types of symbols, and for each type the probabilities are very predictable - to the point where they could even be hard-coded. And upon completion I can be sure the first two questions will be "Can we make it smaller?" and "Can we make it faster?". GZip (while very cool) is adaptive and general-purpose, so it's building frequency tables as it goes and ignoring the structure of the data I should be able to take advantage of.

With an awful lot of trouble, it must be possible to write something in C to go faster and yield better compression than gzip for this particular data. With the probability structure known in advance, there are just a lot of steps taken by gzip that are no longer needed. Besides this, gzip only assumes an arbitrary sequence of bytes, but my data are much more structured than this.

The catch is that gzip can beat even ideal arithmetic compression, if there happen to be correlations between symbols. So what you claim is correct only if there are no correlations other than those taken into account in your known probability structure. Any chance you can tell us what this mysterious data is? But bit stream operations (and data compression) are seriously cool in any case, so I hope you'll go ahead with this! -- David Roundy Department of Physics Oregon State University

Bulat Ziganshin

7:02 a.m.

Hello Chad, Thursday, November 15, 2007, 9:03:52 AM, you wrote:

...

I'd like to be able to use Data.Binary (or similar) for compression. Say I have an abstract type Symbol, and for each value of Symbol I have a representation in terms of some number of bits. For compression to be efficient, commonly-used Symbols should have very short representations, while less common ones can be longer.

alternative may be using naive representation for serializing and then running zip/bzip2 compression lib over it. it should be both faster and provide better compression -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

David Roundy

3:52 p.m.

On Wed, Nov 14, 2007 at 10:03:52PM -0800, Chad Scherrer wrote:

...

I'd like to be able to use Data.Binary (or similar) for compression. Say I have an abstract type Symbol, and for each value of Symbol I have a representation in terms of some number of bits. For compression to be efficient, commonly-used Symbols should have very short representations, while less common ones can be longer.

I agree with others that it's probably not worth your effort to do compression yourself except for fun, but it *is* fun, and interpret my advice below in that light. (Also, bitwise operations could be useful for other things, like interacting with standard formats. e.g. writing IEEE doubles portably, something that Data.Binary doesn't do.) ...

...

(2) This seems like it will work ok, but the feel is not as clean as the current Data.Binary interface. Is there something I'm missing that might make it easier to integrate this?

I would write this as a monad, analogous to PutM. Make bit monads GetBits and PutBitsM (with PutBits = PutBitsM ()), and then write functions like writeBits :: PutBits -> Put readBits :: GetBits a -> Get a where writeBits and readBits would pad their reading/writing to the next byte boundary (or perhaps Word32 boundary, for better performance?) as they must. So now this would have two main uses: users could use it to serialize data (e.g. writing an IEEE Double serialization), or could use it to write their own data compression, but putting *all* the serialization into the Bits level. This approach, of course, also would allow you to copy much of the infrastructure of Binary into your new bit-level interface.

...

(3) Right now this is just proof of concept, but eventually I'd like to do some performance tuning, and it would be nice to have a representation that's amenable to this. Any thoughts on speeding this up while keeping the interface reasonably clean would be much appreciated.

I think a monad as above would have the advantage of separating the implementation from the interface, which should make it tuneable. -- David Roundy Department of Physics Oregon State University

6443

Age (days ago)

6443

Last active (days ago)

List overview

Download

7 comments

6 participants

participants (6)

Bulat Ziganshin
Chad Scherrer
David Roundy
Dominic Steinitz
Don Stewart
Stefan O'Rear