Abstraction leak

Andrew Coppin

29 Jun 2007 29 Jun '07

6:39 p.m.

...and again today I found myself trying to do something that would be very easy in an imperative language, but I cannot think of a single good way of doing it in Haskell. Hopfully somebody can give me some hints. I'm writing a whole bunch of data compression programs. In particular, one of them was a Huffman encoder. The encoder scans the input file, computes symbol frequencies, and builds a canonical Huffman table, which it then dumps to file. The actual compressed data goes after that. The encoder consists of two functions. One takes the Huffman table and serializes it. The other does the actual encoding. The (++) operator joins the two together before going to file. Getting the data back is mildly more tricky. There's a decoder function that takes the complete data stream, parses out the Huffman table, and returns a pair containing the Huffman table and the remaining data. Then a second decoder function actually does the Huffman decoding on this data. So far, it's all pretty easy. The size of the Huffman table determins how many bytes the decoder needs to read, so it's easy to figure out where the table ends. And then I had a brainwave. I altered the encoder so that the output from the Huffman table serialize function codes through a simple RLE encoder before going to file. Then a went and altered the decoder to... uh... oops. Now I have a problem. It's easy enough to pass the entire data stream through an RLE decoder and feed that to the Huffman table deserialize function, and it will give be back the table. But I now have *no clue* where the table ends in the original stream! If this was Java, you would write all these compression and decompression stages as stream wrappers. So you wrap the raw input stream with an RLE decoder, have the function read the Huffman table, and then take off the RLE decoder and process the rest of the stream. However, in Haskell, the reading and deserializing of the Huffman table doesn't so anything detectable to the original data stream, so you can't tell how much of it the RLE decoder actually processed. Um... help! (What I did in the end was to write the size of the RLE encoded Huffman table to the file and read it back in the decoder. But this shouldn't actually be necessary...)

Show replies by date

Miles Sabin

29 Jun 29 Jun

6:49 p.m.

Andrew Coppin wrote,

...

If this was Java, you would write all these compression and decompression stages as stream wrappers. So you wrap the raw input stream with an RLE decoder, have the function read the Huffman table, and then take off the RLE decoder and process the rest of the stream.

Except that if the RLE decoding stream wrapper contains any internal buffering, then stripping it off would very likely result in data loss. What you actually have to do is have the RLE decoding stream wrapper build and return you a stream wrapper which delivers the remainder of the stream. Which I think shows that the abstraction isn't leaky ... where "the remainder" starts is very much dependent on the precise encoding of the prefix of the stream. Cheers, Miles

Andrew Coppin

6:55 p.m.

Miles Sabin wrote:

...

Andrew Coppin wrote,

...
If this was Java, you would write all these compression and decompression stages as stream wrappers. So you wrap the raw input stream with an RLE decoder, have the function read the Huffman table, and then take off the RLE decoder and process the rest of the stream.

Except that if the RLE decoding stream wrapper contains any internal buffering, then stripping it off would very likely result in data loss.

...which is why you design your RLE decoding wrapper to not buffer anything. ;-) You'd have more fun with bit-oriented things. But then, you'd design an object representing a stream of bits instead of a stream of bytes/characters/whatever, and it'd be fine.

...

What you actually have to do is have the RLE decoding stream wrapper build and return you a stream wrapper which delivers the remainder of the stream.

Which I think shows that the abstraction isn't leaky ... where "the remainder" starts is very much dependent on the precise encoding of the prefix of the stream.

It's just that at present, both the Huffman table load and save functions have no idea that they're operating through an RLE layer. Unfortunately, to be able to get the rest of the data, I have to expose to those functions the fact that there's this extra layer happening. (And then what if I want to use something else instead of RLE? What if I want to put *several* layers in? It all gets very messy...)

David Roundy

7:55 p.m.

On Fri, Jun 29, 2007 at 07:39:28PM +0100, Andrew Coppin wrote:

...

Now I have a problem. It's easy enough to pass the entire data stream through an RLE decoder and feed that to the Huffman table deserialize function, and it will give be back the table. But I now have *no clue* where the table ends in the original stream!

Sounds to me like you want a parsing monad. Generally, when you want state, you want a monad, and the field of parsing monads is pretty mature. You can either write up a monad of your own, or use one of the existing ones (parsec, frisby, read). -- David Roundy Department of Physics Oregon State University

Andrew Coppin

10:14 p.m.

David Roundy wrote:

...

On Fri, Jun 29, 2007 at 07:39:28PM +0100, Andrew Coppin wrote:

...
Now I have a problem. It's easy enough to pass the entire data stream through an RLE decoder and feed that to the Huffman table deserialize function, and it will give be back the table. But I now have *no clue* where the table ends in the original stream!

Sounds to me like you want a parsing monad. Generally, when you want state, you want a monad, and the field of parsing monads is pretty mature. You can either write up a monad of your own, or use one of the existing ones (parsec, frisby, read).

Perhaps. But how do you run a parser on top of another parser? More importantly, how do you stack several parsers one on top of the other, get the top-most one to return the thing it parsed, and then make a completely different stack of parsers process the remainder? I'm sure it can be done, but... I'm having trouble wrapping my mind around that at this time of night... :-S

apfelmus

30 Jun 30 Jun

8:42 a.m.

David Roundy wrote:

...

On Fri, Jun 29, 2007 at 07:39:28PM +0100, Andrew Coppin wrote:

...
Now I have a problem. It's easy enough to pass the entire data stream through an RLE decoder and feed that to the Huffman table deserialize function, and it will give be back the table. But I now have *no clue* where the table ends in the original stream!

Sounds to me like you want a parsing monad. Generally, when you want state, you want a monad, and the field of parsing monads is pretty mature. You can either write up a monad of your own, or use one of the existing ones (parsec, frisby, read).

Am I missing something or why wouldn't encode, decode :: String -> String encode = encodeRLE . encodeHuffman decode = decodeHuffman . decodeRLE do the job? This is probably what Andrew intends to do in his Java version. Note that this not only RLE-encodes the Huffman table but also (needlessly) the data stream. In case you only want to RLE the table, a simple Word32 field tracking the size of the Huffman table should be enough. Regards, apfelmus

Andrew Coppin

1 Jul 1 Jul

9:27 a.m.

apfelmus wrote:

...

Am I missing something or why wouldn't

encode, decode :: String -> String encode = encodeRLE . encodeHuffman decode = decodeHuffman . decodeRLE

do the job? This is probably what Andrew intends to do in his Java version. Note that this not only RLE-encodes the Huffman table but also (needlessly) the data stream. In case you only want to RLE the table, a simple Word32 field tracking the size of the Huffman table should be enough.

It is enough. But given that the whole purpose of compression algorithms is to squeeze data into the tiniest possible space, I wanted to avoid having a size field. And mathematically it's perfectly possible to do... I just can't find a convinient way to do it in Haskell. :-(

apfelmus

9:50 a.m.

Andrew Coppin wrote:

...

apfelmus wrote:

...
Am I missing something or why wouldn't

encode, decode :: String -> String encode = encodeRLE . encodeHuffman decode = decodeHuffman . decodeRLE

do the job? This is probably what Andrew intends to do in his Java version. Note that this not only RLE-encodes the Huffman table but also (needlessly) the data stream. In case you only want to RLE the table, a simple Word32 field tracking the size of the Huffman table should be enough.

It is enough. But given that the whole purpose of compression algorithms is to squeeze data into the tiniest possible space, I wanted to avoid having a size field. And mathematically it's perfectly possible to do... I just can't find a convinient way to do it in Haskell. :-(

Well, those 4 bytes won't kill you. But you can of course stop RLE-decoding as soon as this has read as many bytes as there are in the Huffman table. A systematic way to do this are parser combinators. Regards, apfelmus

Andrew Coppin

10:04 a.m.

apfelmus wrote:

...

Andrew Coppin wrote:

...
It is enough. But given that the whole purpose of compression algorithms is to squeeze data into the tiniest possible space, I wanted to avoid having a size field. And mathematically it's perfectly possible to do... I just can't find a convinient way to do it in Haskell. :-(

Well, those 4 bytes won't kill you. But you can of course stop RLE-decoding as soon as this has read as many bytes as there are in the Huffman table. A systematic way to do this are parser combinators.

Yeah... I'm fuzzy on how to do this. I can write parsers to do the various stages, and I can run one parser on top of another. But how to you swap whole "stacks" of parsers when the top-most one reaches a given stage?

Andrew Coppin

12:33 p.m.

OK, well I don't know the Parsec types and function names off the top of my head, but suppose I have the following: runParser :: Parser a b -> [a] -> Either ParseError b parseHuffmanTable :: [x] -> Parser Word8 (HuffmanTable x) parseHuffmanPayload :: HuffmanTable x -> Parser Word8 [x] parseRLE :: Parser Word8 [Word8] Now, if I want to just do without the RLE bit, I can do parseHuffman :: [x] -> Parser Word8 x parseHuffman xs = do table <- parseHuffmanTable xs parseHuffmanPayload table But if I want to add an RLE layer to just the Huffman table... erm... OK, I'm stuck now. :-S 1. How do I run the input through parseRLE and *then* through parseHuffmanTable? 2. How do I get parseHuffmanPayload to continue from where parseRLE left off? (How do I get parseRLE to not parse the entire input, for that matter...)

apfelmus

1:12 p.m.

Andrew Coppin wrote:

...

OK, well I don't know the Parsec types and function names off the top of my head, but suppose I have the following:

runParser :: Parser a b -> [a] -> Either ParseError b

parseHuffmanTable :: [x] -> Parser Word8 (HuffmanTable x)

parseHuffmanPayload :: HuffmanTable x -> Parser Word8 [x]

parseRLE :: Parser Word8 [Word8]

Now, if I want to just do without the RLE bit, I can do

parseHuffman :: [x] -> Parser Word8 x parseHuffman xs = do table <- parseHuffmanTable xs parseHuffmanPayload table

But if I want to add an RLE layer to just the Huffman table... erm... OK, I'm stuck now. :-S

1. How do I run the input through parseRLE and *then* through parseHuffmanTable?

2. How do I get parseHuffmanPayload to continue from where parseRLE left off? (How do I get parseRLE to not parse the entire input, for that matter...)

Well, there's no way to do that with a monolithic parseRLE since it will parse the input to the bitter end. But assuming that parseRLE = concat `liftM` many parseRLEBlock parseRLEBlock :: Parser Word8 [Word8] you can write an RLE parser that repeatedly parses RLE blocks until it has read equal or more than n Word8s parseRLEAmount n | n <= 0 = return [] | otherwise = do xs <- parseRLEBlock xss <- parseRLEAmount (n - length xs) return (xs ++ xss) To parse a huffman table, run the actual parser on the result of parseRLEAmount: parseHuffmanHeader = runParser parseHuffmanTable `liftM` parseRLEAmount (2^8) Regards, apfelmus

Andrew Coppin

1:52 p.m.

apfelmus wrote:

...

Andrew Coppin wrote:

...
OK, I'm stuck now. :-S

1. How do I run the input through parseRLE and *then* through parseHuffmanTable?

2. How do I get parseHuffmanPayload to continue from where parseRLE left off? (How do I get parseRLE to not parse the entire input, for that matter...)

Well, there's no way to do that with a monolithic parseRLE since it will parse the input to the bitter end. But assuming that

parseRLE = concat `liftM` many parseRLEBlock

parseRLEBlock :: Parser Word8 [Word8]

you can write an RLE parser that repeatedly parses RLE blocks until it has read equal or more than n Word8s

parseRLEAmount n | n <= 0 = return [] | otherwise = do xs <- parseRLEBlock xss <- parseRLEAmount (n - length xs) return (xs ++ xss)

To parse a huffman table, run the actual parser on the result of parseRLEAmount:

parseHuffmanHeader = runParser parseHuffmanTable `liftM` parseRLEAmount (2^8)

The use of lifeM is *inspired* - I would never have thought of writing it infix! Still, while this does solve the issue at hand, I'm still left wondering whether there's any way to make this work for some arbitrary parser. parseHuffmanTable always uses exactly N bytes of input, but what if it didn't? What if it was impossible to tell how many bytes it should read without actually running it? How would you implement that? Nice a since general solution exist?

Jon Cast

29 Jun 29 Jun

8:28 p.m.

On Friday 29 June 2007, Andrew Coppin wrote:

...

...and again today I found myself trying to do something that would be very easy in an imperative language, but I cannot think of a single good way of doing it in Haskell. Hopfully somebody can give me some hints.

<snip long and helpful explanation> Here's my solution (drawn from a library I'll be posting Real Soon Now): import Control.Monad import Control.Monad.Trans data SPMT iota omicron m alpha = ReturnSP alpha | LiftSP (m (SPMT iota omicron m alpha)) | GetSP (iota -> SPMT iota omicron m alpha)) | PutSP omicron (SPMT iota omicron m alpha) instance Monad m => Monad (SPMT iota omicron m) where return x = ReturnSP x ReturnSP x >>= f = f x LiftSP a >>= f = LiftSP (liftM (>>= f) a) GetSP a >>= f = GetSP (\ x -> a x >>= f) PutSP x a >>= f = PutSP x (a >>= f) instance MonadTrans (SPMT iota omicron) where lift a = LiftSP (liftM ReturnSP a) getSP :: SPMT iota omicron m iota getSP = GetSP ReturnSP putSP :: omicron -> SPMT iota omicron m () putSP x = PutSP x (ReturnSP ()) (^>^) :: Monad m => SPMT iota omicron m alpha -> SPMT omicron omicron' m beta -> SPMT iota omicron' m beta a ^>^ ReturnSP x = ReturnSP x a ^>^ LiftSP b = LiftSP (liftM (a ^>^) b) a ^>^ PutSP x b = PutSP x (a ^>^ b) LiftSP a ^>^ GetSP b = LiftSP (liftM (^>^ GetSP b) a) GetSP a ^>^ GetSP b = GetSP (\ x -> a x ^>^ GetSP b) PutSP x a ^>^ GetSP b = a ^>^ b x If the signature of SPMT suffices to write decodeRLE and decodeHeader, the task of applying RLE decoding just to the header can be implemented by using decodeRLE ^>^ decodeHeader in place of just decodeHeader. Extension to situations left un-implemented above I leave for your ingenuity and/or release of my library. HTH. Jonathan Cast http://sourceforge.net/projects/fid-core http://sourceforge.net/projects/fid-emacs

Jon Cast

10:07 p.m.

On Friday 29 June 2007, Jon Cast wrote:

...

Here's my solution (drawn from a library I'll be posting Real Soon Now): <snip solution>

I forgot to point out that this is 75-90% drawn from a library called Fudgets[1], which is probably the most extended practical meditation to date on programming with lazy streams in Haskell. Embedding that approach in a monadic interface seems to be my own idea, though. Jonathan Cast http://sourceforge.net/projects/fid-core http://sourceforge.net/projects/fid-emacs [1] http://www.md.chalmers.se/Cs/Research/Functional/Fudgets/

Josef Svenningsson

1 Jul 1 Jul

4:51 p.m.

On 6/30/07, Jon Cast wrote:

...

On Friday 29 June 2007, Jon Cast wrote:

...
Here's my solution (drawn from a library I'll be posting Real Soon Now): <snip solution>

I forgot to point out that this is 75-90% drawn from a library called Fudgets[1], which is probably the most extended practical meditation to date on programming with lazy streams in Haskell. Embedding that approach in a monadic interface seems to be my own idea, though.

Koen Claessen had the same idea. He used it for designing parsers. See: http://www.cs.chalmers.se/~koen/pubs/entry-jfp04-parser.html

...

Jonathan Cast http://sourceforge.net/projects/fid-core http://sourceforge.net/projects/fid-emacs

[1] http://www.md.chalmers.se/Cs/Research/Functional/Fudgets/

Cheers, Josef

Bulat Ziganshin

30 Jun 30 Jun

7:05 a.m.

Hello Andrew, Friday, June 29, 2007, 10:39:28 PM, you wrote:

...

I'm writing a whole bunch of data compression programs.

me too :) but i never used Haskell for compression itself, only for managing archives. fast compression routines are written in C++ http://www.haskell.org/bz -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Andrew Coppin

7:48 a.m.

Bulat Ziganshin wrote:

...

Hello Andrew,

Friday, June 29, 2007, 10:39:28 PM, you wrote:

...
I'm writing a whole bunch of data compression programs.

me too :) but i never used Haskell for compression itself, only for managing archives. fast compression routines are written in C++

What, you're telling me that "fast" software cannot be written in Haskell? :-P Well anyway, speed is not my aim. My aim is to play with various textbook algorithms an examine how well they work for various types of data. As long as the code isn't *absurdly* slow that'll be just fine. (I forget who it was, but a while back somebody pointed out the wisdom of using Data.Map. It certainly makes the LZW implementation about 400% faster! To say nothing of Huffman...) BTW, how do the pros do Huffman coding? Presumably not by traversing trees of pointer-linked nodes to process things 1 bit at a time...

Bulat Ziganshin

6:04 p.m.

New subject: Re[2]: Abstraction leak

Hello Andrew, Saturday, June 30, 2007, 11:48:19 AM, you wrote:

...

...
me too :) but i never used Haskell for compression itself, only for managing archives. fast compression routines are written in C++

...

What, you're telling me that "fast" software cannot be written in Haskell? :-P

in my program, managing archives isn't time-critical part. compression requires 90-99% of total execution time

...

Well anyway, speed is not my aim. My aim is to play with various textbook algorithms an examine how well they work for various types of data. As long as the code isn't *absurdly* slow that'll be just fine.

yes, for studying algorithms Haskell is perfect tool

...

(I forget who it was, but a while back somebody pointed out the wisdom of using Data.Map. It certainly makes the LZW implementation about 400% faster! To say nothing of Huffman...)

it seems that you've done first implementation with lists or immutable arrays? :)

...

BTW, how do the pros do Huffman coding? Presumably not by traversing trees of pointer-linked nodes to process things 1 bit at a time...

encoding is simple - make this traversal one time and store bit sequence for each symbol. for decoding, you can find length of longest symbol MaxBits and build table of 2^MaxBits elements which allows to find next symbol by direct lookup with next input MaxBits. faster algorithms use two-level lookup, first lookup with 9-10 bits is best when your encoded symbols are bytes -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Andrew Coppin

1 Jul 1 Jul

9:18 a.m.

Bulat Ziganshin wrote:

...

Hello Andrew,

Saturday, June 30, 2007, 11:48:19 AM, you wrote:

...
Well anyway, speed is not my aim. My aim is to play with various textbook algorithms an examine how well they work for various types of data. As long as the code isn't *absurdly* slow that'll be just fine.

yes, for studying algorithms Haskell is perfect tool

...
(I forget who it was, but a while back somebody pointed out the wisdom of using Data.Map. It certainly makes the LZW implementation about 400% faster! To say nothing of Huffman...)

it seems that you've done first implementation with lists or immutable arrays? :)

Lists. In fact, I *believe* the code is still on the Wiki somewhere...

...

...
BTW, how do the pros do Huffman coding? Presumably not by traversing trees of pointer-linked nodes to process things 1 bit at a time...

encoding is simple - make this traversal one time and store bit sequence for each symbol. for decoding, you can find length of longest symbol MaxBits and build table of 2^MaxBits elements which allows to find next symbol by direct lookup with next input MaxBits. faster algorithms use two-level lookup, first lookup with 9-10 bits is best when your encoded symbols are bytes

I see. So build a table of codes and bitmasks and test against that...

Bulat Ziganshin

2:18 p.m.

New subject: Re[2]: Abstraction leak

Hello Andrew, Sunday, July 1, 2007, 1:18:16 PM, you wrote:

...

...
encoding is simple - make this traversal one time and store bit sequence for each symbol. for decoding, you can find length of longest symbol MaxBits and build table of 2^MaxBits elements which allows to find next symbol by direct lookup with next input MaxBits. faster algorithms use two-level lookup, first lookup with 9-10 bits is best when your encoded symbols are bytes

...

I see. So build a table of codes and bitmasks and test against that...

decodeSymbol = do n <- returnNextNBits MaxBits -- this operation doesn't forward input pointer! symbol <- table1 ! n bits <- table2 ! symbol skipNBits bits return symbol -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Andrew Coppin

2:48 p.m.

Bulat Ziganshin wrote:

...

Hello Andrew,

...
I see. So build a table of codes and bitmasks and test against that...

decodeSymbol = do n <- returnNextNBits MaxBits -- this operation doesn't forward input pointer! symbol <- table1 ! n bits <- table2 ! symbol skipNBits bits return symbol

I see. While we're on the subject... am I the first person to notice that Haskell doesn't appear to have much support for fiddling with streams of bits?

Jon Fairbairn

4:21 p.m.

Andrew Coppin writes:

...

While we're on the subject... am I the first person to notice that Haskell doesn't appear to have much support for fiddling with streams of bits?

No. Presumably the author of Data.Bits noticed some lack. (Note that Integer is an instance of Num and hence Bits) -- Jón Fairbairn Jon.Fairbairn@cl.cam.ac.uk

Andrew Coppin

5:07 p.m.

Jon Fairbairn wrote:

...

Andrew Coppin writes:

...
While we're on the subject... am I the first person to notice that Haskell doesn't appear to have much support for fiddling with streams of bits?

No. Presumably the author of Data.Bits noticed some lack. (Note that Integer is an instance of Num and hence Bits)

Right. But (for instance) there is no library function anywhere to convert a Word16 into a [Bool] (never mind handling endian issues), or back again. There is no standard construct for reading from a [Word8] as if it's a [Bool], or for writing to a [Bool] and ending up with a [Word8]... ...Then again, there is no library function for reading from a file and getting a [Word8]. The only way I have found to do this is to open a file, put the handle into "binary mode", use hGetContents to read a [Char] from it, and then do a map (fromIntegral . fromEnum) on that. Writing binary data back to a file requires opening a handle, putting it into binary mode, taking your [Word8] and doing a map (toEnum . fromIntegral) over it, and then using putStr. All of which works because, apparently, in binary mode the file is interpreted as 8-bit ASCII. God forbit that Haskell ever decides to transparently support other encodings... I haven't actually tried, but presumably a TCP connection is represented in the same way as a file, and so has the same problems. Basically doing binary I/O seems to be one of those things that in Haskell falls into the class of "it's possibly but annoyingly messy"...

Philip Armstrong

4 Jul 4 Jul

10:57 a.m.

On Sun, Jul 01, 2007 at 06:07:13PM +0100, Andrew Coppin wrote:

...

I haven't actually tried, but presumably a TCP connection is represented in the same way as a file, and so has the same problems.

Basically doing binary I/O seems to be one of those things that in Haskell falls into the class of "it's possibly but annoyingly messy"...

In an ideal world there would be a 'deriving Serializable[1]' you could do on datatypes which would get this right. In a really ideal world, you could specify the data layout somehow[2][2a], which would make integrating Haskell code into a wider distributed network of processes exchanging binary data a cinch. In a super really ideal world, you could operate on the packets in place in Haskell where possible and save the deserialization overhead... Anyone trying to do any of this? Phil [1] deriving picklable? [2] DFDL does this for XML / binary data translation. [2a] Or even dump to arbitrary formats: XML, JSON for concrete datatypes[3], mabe use the approach from http://www.ps.uni-sb.de/Papers/abstracts/hotPickles2007.html (link stolen shamelessly from Lambda the Ultimate) for higher order data? [3] Maybe UBL from http://www.erlang.se/workshop/2002/Armstrong.pdf ? -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

dons＠cse.unsw.edu.au

11:02 a.m.

phil:

...

On Sun, Jul 01, 2007 at 06:07:13PM +0100, Andrew Coppin wrote:

...
I haven't actually tried, but presumably a TCP connection is represented in the same way as a file, and so has the same problems.

Basically doing binary I/O seems to be one of those things that in Haskell falls into the class of "it's possibly but annoyingly messy"...

In an ideal world there would be a 'deriving Serializable[1]' you

derive Binary (use an external tool for this)

...

could do on datatypes which would get this right. In a really ideal world, you could specify the data layout somehow[2][2a], which would

Directly in Haskell data type decls -- see the ICFP 05 paper on the House OS, which added packing annotations, and bit syntax. In current Haskell, you specify the layout in instances Storable or Binary.

...

make integrating Haskell code into a wider distributed network of processes exchanging binary data a cinch. In a super really ideal

Definitely. See things like the zlib or iconv Data.Binary/Data.ByteString bindings, for prototypes. The 'tar' reader/writer on hackage.haskell.org is also a good example.

...

world, you could operate on the packets in place in Haskell where possible and save the deserialization overhead...

Data.ByteString.* for this.

...

Anyone trying to do any of this?

Yeah, its a bit of a hot topic currently. :) Gotta chase Erlang for hacking network data. -- Don

Philip Armstrong

11:39 a.m.

On Wed, Jul 04, 2007 at 09:02:15PM +1000, Donald Bruce Stewart wrote:

...

phil:

...
On Sun, Jul 01, 2007 at 06:07:13PM +0100, Andrew Coppin wrote:

...
I haven't actually tried, but presumably a TCP connection is represented in the same way as a file, and so has the same problems.

Basically doing binary I/O seems to be one of those things that in Haskell falls into the class of "it's possibly but annoyingly messy"...

In an ideal world there would be a 'deriving Serializable[1]' you

derive Binary (use an external tool for this)

such as?

...

...
could do on datatypes which would get this right. In a really ideal world, you could specify the data layout somehow[2][2a], which would

Directly in Haskell data type decls -- see the ICFP 05 paper on the House OS, which added packing annotations, and bit syntax. In current Haskell, you specify the layout in instances Storable or Binary.

I'll have a look.

...

...
make integrating Haskell code into a wider distributed network of processes exchanging binary data a cinch. In a super really ideal

Definitely. See things like the zlib or iconv Data.Binary/Data.ByteString bindings, for prototypes. The 'tar' reader/writer on hackage.haskell.org is also a good example.

OK. Maybe this is the sort of stuff which ought to go into the new Haskell book? 'Integrating Haskell with external data sources' or something...

...

...
world, you could operate on the packets in place in Haskell where possible and save the deserialization overhead...

Data.ByteString.* for this.

...
Anyone trying to do any of this?

Yeah, its a bit of a hot topic currently. :)

...

Gotta chase Erlang for hacking network data.

Absolutely. Ta for the pointers anyway. Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

dons＠cse.unsw.edu.au

11:44 a.m.

phil:

...

On Wed, Jul 04, 2007 at 09:02:15PM +1000, Donald Bruce Stewart wrote:

...
phil:

...
On Sun, Jul 01, 2007 at 06:07:13PM +0100, Andrew Coppin wrote:

...
I haven't actually tried, but presumably a TCP connection is represented in the same way as a file, and so has the same problems.

Basically doing binary I/O seems to be one of those things that in Haskell >falls into the class of "it's possibly but annoyingly messy"...

In an ideal world there would be a 'deriving Serializable[1]' you

derive Binary (use an external tool for this)

such as?

Binary instances are pretty easy to write. For a simple data type: > instance Binary Exp where > put (IntE i) = do put (0 :: Word8) > put i > put (OpE s e1 e2) = do put (1 :: Word8) > put s > put e1 > put e2 > get = do tag <- getWord8 > case tag of > 0 -> liftM IntE get > 1 -> liftM3 OpE get get get The Data.Binary comes with one tool to derive these. The DrIFT preprocessor also can, as can Stefan O'Rear's SYB deriver. I just write them by hand, or use the tool that comes with the lib. More docs here, http://hackage.haskell.org/packages/archive/binary/0.3/doc/html/Data-Binary....

...

...
...
could do on datatypes which would get this right. In a really ideal world, you could specify the data layout somehow[2][2a], which would

Directly in Haskell data type decls -- see the ICFP 05 paper on the House OS, which added packing annotations, and bit syntax. In current Haskell, you specify the layout in instances Storable or Binary.

I'll have a look.

...
...
make integrating Haskell code into a wider distributed network of processes exchanging binary data a cinch. In a super really ideal

Definitely. See things like the zlib or iconv Data.Binary/Data.ByteString bindings, for prototypes. The 'tar' reader/writer on hackage.haskell.org is also a good example.

OK. Maybe this is the sort of stuff which ought to go into the new Haskell book? 'Integrating Haskell with external data sources' or something...

Indeed :-)

...

...
...
world, you could operate on the packets in place in Haskell where possible and save the deserialization overhead...

Data.ByteString.* for this.

...
Anyone trying to do any of this?

Yeah, its a bit of a hot topic currently. :)

...
Gotta chase Erlang for hacking network data.

Absolutely. Ta for the pointers anyway.

Yeah, so: Data.ByteString Data.Bits Data.Binary Hack those bytes! Quickly! :-) -- Don

Philip Armstrong

1:50 p.m.

New subject: Binary serialization, was Re: Abstraction leak

On Wed, Jul 04, 2007 at 09:44:13PM +1000, Donald Bruce Stewart wrote:

...

Binary instances are pretty easy to write. For a simple data type:

> instance Binary Exp where > put (IntE i) = do put (0 :: Word8) > put i > put (OpE s e1 e2) = do put (1 :: Word8) > put s > put e1 > put e2

> get = do tag <- getWord8 > case tag of > 0 -> liftM IntE get > 1 -> liftM3 OpE get get get

That's quite verbose! Plus I'm a bit concerned by the boxing implied by those IntE / OpE constructors in get. If you were using those values in a pattern match on the result of get, would the compiler be able to eliminate them and refer directly to the values in the source data?

...

The Data.Binary comes with one tool to derive these. The DrIFT preprocessor also can, as can Stefan O'Rear's SYB deriver.

I just write them by hand, or use the tool that comes with the lib.

More docs here, http://hackage.haskell.org/packages/archive/binary/0.3/doc/html/Data-Binary....

This doesn't seem to deal with endianness. Am I missing something?

...

...
...
...
world, you could operate on the packets in place in Haskell where possible and save the deserialization overhead...

Data.ByteString.* for this.

Ah, does Data.Binary fuse with ByteString.* then?

...

Hack those bytes! Quickly! :-)

:) It's a shame the layout definition is so verbose. Erlang's is quite compact. I wonder if something could be done with template haskell to translate an Erlang-style data layout definition to the Data.Binary form? (Bonus points for being able to parse ASN.1 and generate appropriate Haskell datatypes & serialization primitives automatically :-) ) Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

Bulat Ziganshin

2:52 p.m.

New subject: Re[2]: Binary serialization, was Re: Abstraction leak

Hello Philip, Wednesday, July 4, 2007, 5:50:42 PM, you wrote:

...

This doesn't seem to deal with endianness. Am I missing something?

alternative: http://haskell.org/haskellwiki/Library/AltBinary http://haskell.org/haskellwiki/Library/Streams -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Philip Armstrong

3:31 p.m.

New subject: Binary serialization, was Re: Abstraction leak

On Wed, Jul 04, 2007 at 06:52:08PM +0400, Bulat Ziganshin wrote:

...

Hello Philip,

Wednesday, July 4, 2007, 5:50:42 PM, you wrote:

...
This doesn't seem to deal with endianness. Am I missing something?

alternative: http://haskell.org/haskellwiki/Library/AltBinary http://haskell.org/haskellwiki/Library/Streams

Nice: bit aligning if you want it, little or big endian IO. Intermixed endianness in the same datastream even[1]. However: 3.2.10 Defining Binary instances for custom serialization formats (unwritten) Does that mean that the code is unwritten or that the documentation is unwritten. IAMFI :) There seems to be some overlap between Streams and ByteStrings: Could a Stream built on a ByteString backend benefit from all the fusion work that's been put into ByteStrings recently? Oh wait, I see you list that as 'future work' on the wiki page... Phil [1] Which sick application *needs* intermixed endianness? -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

Bulat Ziganshin

5:15 p.m.

New subject: Re[2]: Binary serialization, was Re: Abstraction leak

Hello Philip, Wednesday, July 4, 2007, 7:31:56 PM, you wrote:

...

On Wed, Jul 04, 2007 at 06:52:08PM +0400, Bulat Ziganshin wrote:

...
Hello Philip,

Wednesday, July 4, 2007, 5:50:42 PM, you wrote:

...
This doesn't seem to deal with endianness. Am I missing something?

alternative: http://haskell.org/haskellwiki/Library/AltBinary http://haskell.org/haskellwiki/Library/Streams

...

Nice: bit aligning if you want it, little or big endian IO. Intermixed endianness in the same datastream even[1]. However:

...

3.2.10 Defining Binary instances for custom serialization formats (unwritten)

...

Does that mean that the code is unwritten or that the documentation is unwritten. IAMFI :)

of course all "unwritten" notes means unfinished docs. library contains more than 100 functions so it was not easy to document them all. you can browse sources, although probably it will not help too much in particular interest for you should be the following: - Template Haskell can used to automatically derive new Binary instances for specifying width of fields you may use int/word with specific width, for example: data Header = H Int32 Word16 Word8 $(deriveBinary ``Header) (see DeriveBinary.hs example) if you need to define instances manually, ask me

...

There seems to be some overlap between Streams and ByteStrings: Could a Stream built on a ByteString backend benefit from all the fusion work that's been put into ByteStrings recently? Oh wait, I see you list that as 'future work' on the wiki page...

if you will write all popular words together, this probably will be just a set of popular words, not something working :) how fusion should work together with serialization?

...

[1] Which sick application *needs* intermixed endianness?

i just tried to implement everything possible :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Philip Armstrong

5:41 p.m.

New subject: Binary serialization, was Re: Abstraction leak

On Wed, Jul 04, 2007 at 09:15:59PM +0400, Bulat Ziganshin wrote:

...

...
Does that mean that the code is unwritten or that the documentation is unwritten. IAMFI :)

of course all "unwritten" notes means unfinished docs. library contains more than 100 functions so it was not easy to document them all. you can browse sources, although probably it will not help too much

OK.

...

...
There seems to be some overlap between Streams and ByteStrings: Could a Stream built on a ByteString backend benefit from all the fusion work that's been put into ByteStrings recently? Oh wait, I see you list that as 'future work' on the wiki page...

if you will write all popular words together, this probably will be just a set of popular words, not something working :) how fusion should work together with serialization?

I'm thinking of the elimination of the boxing of values drawn out of the input stream where possible, eg if I was writing a stream processor that folded across the values in the input stream, it would (presumably) be more efficient if the compiler noticed that the function in question was (say) just reading Int values at offsets within the stream, and could pass those as unboxed references in the compiled code rather than freshly constructed values. Fusion might be the wrong term: I was thinking by analogy with loop fusion, with one of the loops was the 'data reading' loop. Does that make sense?

...

...
[1] Which sick application *needs* intermixed endianness?

i just tried to implement everything possible :)

Completeness is always good! Thanks for the pointers, Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

Bulat Ziganshin

7:49 p.m.

New subject: Re[2]: Binary serialization, was Re: Abstraction leak

Hello Philip, Wednesday, July 4, 2007, 9:41:27 PM, you wrote:

...

I'm thinking of the elimination of the boxing of values drawn out of the input stream where possible, eg if I was writing a stream processor that folded across the values in the input stream, it would (presumably) be more efficient if the compiler noticed that the function in question was (say) just reading Int values at offsets within the stream, and could pass those as unboxed references in the compiled code rather than freshly constructed values.

it will depend on your code. the library doesn't make unnecessary boxing, but (unlike Data.Binary?) it supports only monadic (de)serialization. so there is no room for ByteString-like fusion which pass unboxed data through several transformations. with my lib, you can only read whole unboxed structure and then process it: data T = C !Int32 !Word16 do x <- get h :: IO T processT $! x -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Andrew Coppin

6:36 p.m.

New subject: Binary serialization, was Re: Abstraction leak

Philip Armstrong wrote:

...

[1] Which sick application *needs* intermixed endianness?

*Clearly* you've never been to Singapore... ...er, I mean, "Ever tried playing with networking protocol stacks?"

Philip Armstrong

9:41 p.m.

New subject: Binary serialization, was Re: Abstraction leak

On Wed, Jul 04, 2007 at 07:36:11PM +0100, Andrew Coppin wrote:

...

Philip Armstrong wrote:

...
[1] Which sick application *needs* intermixed endianness?

*Clearly* you've never been to Singapore...

...er, I mean, "Ever tried playing with networking protocol stacks?"

No (thankfully?). Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

Stefan O'Rear

5:25 p.m.

New subject: Binary serialization, was Re: Abstraction leak

On Wed, Jul 04, 2007 at 02:50:42PM +0100, Philip Armstrong wrote:

...

...
The Data.Binary comes with one tool to derive these. The DrIFT preprocessor also can, as can Stefan O'Rear's SYB deriver.

I just write them by hand, or use the tool that comes with the lib.

More docs here,

http://hackage.haskell.org/packages/archive/binary/0.3/doc/html/Data-Binary....

This doesn't seem to deal with endianness. Am I missing something?

The Data.Binary high level interface standardizes on 64-bit big endian. The low level interface allows you to choose it yourself. Stefan

dons＠cse.unsw.edu.au

10:50 p.m.

New subject: Binary serialization, was Re: Abstraction leak

phil:

...

On Wed, Jul 04, 2007 at 09:44:13PM +1000, Donald Bruce Stewart wrote:

...
Binary instances are pretty easy to write. For a simple data type:

> instance Binary Exp where > put (IntE i) = do put (0 :: Word8) > put i > put (OpE s e1 e2) = do put (1 :: Word8) > put s > put e1 > put e2

> get = do tag <- getWord8 > case tag of > 0 -> liftM IntE get > 1 -> liftM3 OpE get get get

That's quite verbose! Plus I'm a bit concerned by the boxing implied by those IntE / OpE constructors in get. If you were using those values in a pattern match on the result of get, would the compiler be able to eliminate them and refer directly to the values in the source data?

Well, here's you're flattening a Haskell structure, so it has to get reboxed. If it was bytestring chunks, or Ints, then you can avoid any serious copying. The 'get' just tags a value.

...

...
The Data.Binary comes with one tool to derive these. The DrIFT preprocessor also can, as can Stefan O'Rear's SYB deriver.

I just write them by hand, or use the tool that comes with the lib.

More docs here, http://hackage.haskell.org/packages/archive/binary/0.3/doc/html/Data-Binary....

This doesn't seem to deal with endianness. Am I missing something?

That's the Haskell serialisation layer. Look at Data.Binary.Get/Put for endian-primitives, to be used instead of 'get'. i.e. getWord16be

...

...
...
...
...
world, you could operate on the packets in place in Haskell where possible and save the deserialization overhead...

Data.ByteString.* for this.

Ah, does Data.Binary fuse with ByteString.* then?

They know about each other, and Binary avoids copying if you're reading ByteStrings.

...

...
Hack those bytes! Quickly! :-)

:)

It's a shame the layout definition is so verbose. Erlang's is quite compact. I wonder if something could be done with template haskell to translate an Erlang-style data layout definition to the Data.Binary form?

Right, simple but a bit verbose. The Erlang bit syntax is a nice pattern matching/layout syntax for bit/byte data. There's a couple of ports of this to Haskell -- one using pattern guards, another using Template Haskell. Look on hackage.haskell.org for bitsyntax if you're interested.

...

(Bonus points for being able to parse ASN.1 and generate appropriate Haskell datatypes & serialization primitives automatically :-) )

I think there's at least an ASN.1 definition in the crypto library. Dominic might be able to enlighten us on that. -- Don

Philip Armstrong

5 Jul 5 Jul

7:12 a.m.

New subject: Binary serialization, was Re: Abstraction leak

On Thu, Jul 05, 2007 at 08:50:42AM +1000, Donald Bruce Stewart wrote: [useful stuff] So, in fact pretty much everything I was looking for exists, in some form or other! It's just a bit hard to find at the moment, perhaps because none of this stuff is regarded as 'core Haskell' by any of the tutorials, books etc etc. I'll have a play with some of the libraries mentioned upthread anyway. Thanks everyone. Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

Thomas Conway

7:55 a.m.

New subject: Binary serialization, was Re: Abstraction leak

I was explaining Haskell to a perl/python hacking friend recently and characterized things thus: Perl is a horrible language with fantastic libraries. Haskell is a fantastic language with horrible libraries. Actually, many of the libraries that exist for Haskell *are* fantastic, it's just that Haskell lacks the *coverage* that Perl or Python have. cheers, T. -- Dr Thomas Conway drtomc@gmail.com Silence is the perfectest herald of joy: I were but little happy, if I could say how much.

dons＠cse.unsw.edu.au

8:01 a.m.

New subject: Write a library today! Was: Binary serialization, was Re: Abstraction leakAKa

drtomc:

...

I was explaining Haskell to a perl/python hacking friend recently and characterized things thus:

Perl is a horrible language with fantastic libraries. Haskell is a fantastic language with horrible libraries.

Actually, many of the libraries that exist for Haskell *are* fantastic, it's just that Haskell lacks the *coverage* that Perl or Python have.

Yes, and we know exactly what to do about this. hackage.haskell.org is growing by a few packages a week -- and anyone who binds to any C lib should just upload their stuff. So ... if you're reading this message -- please upload a library today, and more than just a few of us might have nice Haskell jobs tomorrow! -- Don P.S. Maybe we should run 'bindathons' or have CPAN-style contests to get new libraries written? If you're bored, write a binding to some C lib, that python, ruby or perl have already got a binding to. Do it!

Jonathan Cast

2:31 p.m.

New subject: Binary serialization, was Re: Abstraction leak

On Thursday 05 July 2007, Thomas Conway wrote:

...

I was explaining Haskell to a perl/python hacking friend recently and characterized things thus:

Perl is a horrible language with fantastic libraries. Haskell is a fantastic language with horrible libraries.

Actually, many of the libraries that exist for Haskell *are* fantastic, it's just that Haskell lacks the *coverage* that Perl or Python have.

Can't say I agree. I've been learning Python, and have been very un-impressed so far with its library coverage, which I would rate no better than (in terms of the POSIX bindings, worse than) Haskell. The one thing off the top of my head that Python had was Base64, but that's 20 lines of Haskell tops. Aside from that, nothing. Jonathan Cast http://sourceforge.net/projects/fid-core http://sourceforge.net/projects/fid-emacs

Paul Moore

3 p.m.

New subject: Binary serialization, was Re: Abstraction leak

On 05/07/07, Jonathan Cast wrote:

...

Can't say I agree. I've been learning Python, and have been very un-impressed so far with its library coverage, which I would rate no better than (in terms of the POSIX bindings, worse than) Haskell.

It probably depends on your perspective. I've found lots of tasks that would be a simple library call in Python, but which require me to write the code myself in Haskell. Examples: * Send an email * Parse an ini file * Gzip compress a data stream * Calculate the MD5 checksum of a file (Of course, I may just not have found the relevant library - that says something about discoverability rather than coverage, I guess). For bindings, Python's Windows bindings (pywin32) are superb, where Haskell's are minimal and unmaintained. Of course, that won't matter to you if you use POSIX...

...

The one thing off the top of my head that Python had was Base64, but that's 20 lines of Haskell tops. Aside from that, nothing.

But that's 20 lines of code I don't want to write, and more, I don't know how to write (without looking up the definition of Base64). Having lots of these seemingly trivial helpers available "out of the box" is what library coverage means to me. (And Python does have lots of these - I don't know how Haskell fares in practice). I'm not trying to start (or fan) a flamewar, but it's interesting how different people's perspectives on libraries can be... Paul.

Bulat Ziganshin

3:38 p.m.

New subject: Re[2]: Binary serialization, was Re: Abstraction leak

Hello Paul, Thursday, July 5, 2007, 7:00:46 PM, you wrote:

...

* Gzip compress a data stream

zlib

...

* Send an email * Parse an ini file

...
The one thing off the top of my head that Python had was Base64, but that's 20

MissingH

...

* Calculate the MD5 checksum of a file

crypto -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Paul Moore

4:07 p.m.

New subject: Re[2]: Binary serialization, was Re: Abstraction leak

On 05/07/07, Bulat Ziganshin wrote:

...

...
* Gzip compress a data stream zlib

...
* Send an email * Parse an ini file

...
The one thing off the top of my head that Python had was Base64, but that's MissingH

...
* Calculate the MD5 checksum of a file crypto

Thanks. The need I had for these is no longer current, but sometime I'll try an experiment and see how easy it is, on a relatively clean Windows box with just GHC installed, to grab and use these libraries. (Side note: with Python, I'm used to 3rd party modules being available as Windows installer packages - does the concept of an installable binary for something like MissingH, which I can just install and use, make sense for a compiled language like Haskell? The build, find I'm missing a C library, get it, compile it, fix bugs, try again cycle I used to hit before Python Windows installers became common isn't something I'd like to repeat...) As I mentioned in passing, it may well be that the library issue with Haskell is more of a perception (or maybe organisational) issue than a technical one... Paul.

Duncan Coutts

4:31 p.m.

New subject: Re[2]: Binary serialization, was Re: Abstraction leak

On Thu, 2007-07-05 at 17:07 +0100, Paul Moore wrote:

...

On 05/07/07, Bulat Ziganshin wrote:

...
...
* Gzip compress a data stream zlib

...
* Send an email * Parse an ini file

...
The one thing off the top of my head that Python had was Base64, but that's MissingH

...
* Calculate the MD5 checksum of a file crypto

Thanks.

The need I had for these is no longer current, but sometime I'll try an experiment and see how easy it is, on a relatively clean Windows box with just GHC installed, to grab and use these libraries.

Just to warn you, a lot of haskell packages that bind to C libs are much harder to get working on Windows, the zlib package for example. This is because on all other platforms zlib comes with the system and is installed in a location where any application can link to it. On Windows there is no equivalent of /usr/lib you cannot easily install a C lib somewhere that it can be used by any .exe on the system. To make things easier yuo could avoid using -fvia-C and then at least the zlib header files would not need to be installed, but to run a program that uses the zlib package you'd still have to copy the zlib.dll into the same dir as your .exe. There is a mechanism in newer versions of Windows that allows installing .dlls systemwide where any .exe can use them, however ghc and the gcc toolchain do not support them yet. It requires embeding xml manifests into the .dll and .exe files and you have to be the admin user to install one of these systemwide .dll things. It's all a bit of a pain. Duncan

Bulat Ziganshin

4:32 p.m.

New subject: Re[4]: Binary serialization, was Re: Abstraction leak

Hello Paul, Thursday, July 5, 2007, 8:07:34 PM, you wrote:

...

note: with Python, I'm used to 3rd party modules being available as Windows installer packages - does the concept of an installable binary for something like MissingH, which I can just install and use, make sense for a compiled language like Haskell? The build, find I'm

all cabalized libraries are installed using the same commands: runghc setup.hs configure runghc setup.hs build runghc setup.hs install after this, you just "import" modules you need in your program and when you build your program, ghc automatically links in all the library functions you used

...

As I mentioned in passing, it may well be that the library issue with Haskell is more of a perception (or maybe organisational) issue than a technical one...

the problem exists and for me, it's most important Haskell insfrastructure problem. but things you are mentioned are already implemented i personally implemented several libs required for my own program; and still miss ease-to-use and maintained GUI library -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Paul Moore

4:39 p.m.

New subject: Re[2]: Binary serialization, was Re: Abstraction leak

On 05/07/07, Paul Moore wrote:

...

The need I had for these is no longer current, but sometime I'll try an experiment and see how easy it is, on a relatively clean Windows box with just GHC installed, to grab and use these libraries.

Just for fun I had a go with crypto: - Found crypto 3.0.3 on hackage. - Tried to build, it depends on NewBinary - Found that on hackage, downloaded and built OK. Lots of scary warnings about happy, greencard etc, not being found during configure, but let's go on. - Installed NewBinary as I didn't know how to make crypto find it without installing it. I'm a bit nervous, as I don't know how to *un*install it after I've finished. And it installed to my C drive, where I'd really rather it went somewhere else. There is probably documentation on how to do this, but remember, all I really want to do is to write a tiny program to get the MD5 checksum of a file. Ah, well. Carry on. - crypto builds and installs OK. But where are the docs? Not installed anywhere obvious, not on hackage. Try google. - Found the crypto website, but aargh! It looks like 3.0.3 is out of date and there's a 4.x available. Never mind, that's not likely to have changed. - But no simple examples, and the haddoc docs show APIs, but not usage examples! Going for the obvious approach: import System.IO import Data.Digest.MD5 main = do h <- openBinaryFile "md5.hs" ReadMode s <- hGetContents h hClose h md5 <- hash s No surprise, this doesn't work. After all, hash wants [Octet], not String. OK, I know this is now getting beyond library availability. But it doesn't compare well to Python's "I want an md5 checksum of a file - check the docs, there's a library function built in, use it, no problems". I see you've already responded, and we're in broad agreement. So I won't labour the point. It's an infrastructure issue rather than a technical one, and it *will* improve. What will be interesting is how much the generally lousy Windows experience can be improved - as Duncan points out, installing development libraries on Windows, whether Haskell or C, is a hugely irritating pain. If no-one has found a good answer for C in all these years, it would be great if Haskell could even do slightly better :-) Regards, Paul.

Neil Mitchell

4:51 p.m.

New subject: Re[2]: Binary serialization, was Re: Abstraction leak

Hi It's not a great experience now, but hopefully things are moving in the right direction.

...

- Found crypto 3.0.3 on hackage. - Tried to build, it depends on NewBinary

Cabal-install is intended to remove this problem, so that you can say "i want crypto" and it gets everything that requires.

...

- Found that on hackage, downloaded and built OK. Lots of scary warnings about happy, greencard etc, not being found during configure, but let's go on.

I've complained about these before, although I don't think anyone considered doing anything about it.

...

- Installed NewBinary as I didn't know how to make crypto find it without installing it. I'm a bit nervous, as I don't know how to *un*install it after I've finished. And it installed to my C drive, where I'd really rather it went somewhere else. There is probably documentation on how to do this, but remember, all I really want to do is to write a tiny program to get the MD5 checksum of a file.

--prefix should put it where you want. What you really want is a Windows user interface, which is what I want too.

...

- crypto builds and installs OK. But where are the docs? Not installed anywhere obvious, not on hackage. Try google.

You can install them with runhaskell Setup haddock, but I have no idea where they end up. They will be on hackage at some point, with cross-indexing etc.

...

- But no simple examples, and the haddoc docs show APIs, but not usage examples!

Complain to the author. I always try and include a manual with at least one short example of how to use the library to do something interesting. Unfortunately all these things take time, something that not everyone has.

...

I see you've already responded, and we're in broad agreement. So I won't labour the point. It's an infrastructure issue rather than a technical one, and it *will* improve. What will be interesting is how much the generally lousy Windows experience can be improved

Part of the problem is that the number of Windows Haskell developers is low. Another part of the problem is that some people have scorn and contempt for Windows. Alas, those things are hard to change. Thanks Neil

Paul Moore

4:58 p.m.

New subject: Re[2]: Binary serialization, was Re: Abstraction leak

On 05/07/07, Neil Mitchell wrote:

...

...
- But no simple examples, and the haddoc docs show APIs, but not usage examples!

Complain to the author.

Yes, that's completely unrelated to library availability issues. I got off the topic, in all my ranting. Sorry.

...

Part of the problem is that the number of Windows Haskell developers is low. Another part of the problem is that some people have scorn and contempt for Windows. Alas, those things are hard to change.

Agreed. I'm extremely grateful for all the contributions people *do* make, even where they don't fit my environment as well as I might like. Anyway, I'm running out of time for today, so I'll leave this thread now. Thanks to all for some interesting comments and pointers! Paul.

Duncan Coutts

5:08 p.m.

New subject: Re[2]: Binary serialization, was Re: Abstraction leak

On Thu, 2007-07-05 at 17:51 +0100, Neil Mitchell wrote:

...

Hi

It's not a great experience now, but hopefully things are moving in the right direction.

...
- Found crypto 3.0.3 on hackage. - Tried to build, it depends on NewBinary

Cabal-install is intended to remove this problem, so that you can say "i want crypto" and it gets everything that requires.

...
- Found that on hackage, downloaded and built OK. Lots of scary warnings about happy, greencard etc, not being found during configure, but let's go on.

I've complained about these before, although I don't think anyone considered doing anything about it.

We know what needs to change, but it's not a trivial change. The problem is that currently .cabal files to not specify what build tools they need, so Cabal has to look for all the possible tools it knows about before it finds out if any of them will be needed. This is because at the moment it has to check for these tools in the configure step, but currently it only finds out if it needs the tools in the build phase. The right thing to do is to have proper dep resolution that works out what tools are needed and it should tell the developer to record this in the .cabal file in a new build tools dependency field. They do have to be recorded in the .cabal file because it's impossible for the dep resolution to discover the build tools required in the configure step without actually running some of those build tools, which obviously should not happen in the configure step. So the best we can do is tell developers when they've missed a tool. Then once they're recorded in the .cabal file it'll be easy to avoid looking for and warning about build tools that are not required. Duncan

Ian Lynagh

6:09 p.m.

New subject: Re[2]: Binary serialization, was Re: Abstraction leak

On Thu, Jul 05, 2007 at 06:08:45PM +0100, Duncan Coutts wrote:

...

On Thu, 2007-07-05 at 17:51 +0100, Neil Mitchell wrote:

...
...
- Found that on hackage, downloaded and built OK. Lots of scary warnings about happy, greencard etc, not being found during configure, but let's go on.

I've complained about these before, although I don't think anyone considered doing anything about it.

We know what needs to change, but it's not a trivial change.

If anyone's interested, this is the Cabal bug for it: http://hackage.haskell.org/trac/hackage/ticket/132 Thanks Ian

Ketil Malde

6 Jul 6 Jul

6:26 a.m.

New subject: Re[2]: Binary serialization, was Re: Abstraction leak

On Thu, 2007-07-05 at 18:08 +0100, Duncan Coutts wrote:

...

...
...
- Found that on hackage, downloaded and built OK. Lots of scary warnings about happy, greencard etc, not being found during configure, but let's go on.

...

...
I've complained about these before, although I don't think anyone considered doing anything about it.

...

We know what needs to change, but it's not a trivial change.

...

The problem is that currently .cabal files to not specify what build tools they need,

A possible stop-gap might be to only report the build tools that were actually found? The way it is now, it looks too much like an error message. -k

Duncan Coutts

4:10 p.m.

New subject: Re[2]: Binary serialization, was Re: Abstraction leak

On Fri, 2007-07-06 at 08:26 +0200, Ketil Malde wrote:

...

On Thu, 2007-07-05 at 18:08 +0100, Duncan Coutts wrote:

...
...
...
- Found that on hackage, downloaded and built OK. Lots of scary warnings about happy, greencard etc, not being found during configure, but let's go on.

...
...
I've complained about these before, although I don't think anyone considered doing anything about it.

...
We know what needs to change, but it's not a trivial change.

...
The problem is that currently .cabal files to not specify what build tools they need,

A possible stop-gap might be to only report the build tools that were actually found? The way it is now, it looks too much like an error message.

Mmm, good point. Perhaps we could just change the verbosity of that message for now, so that it doesn't appear at the default verbosity level. Duncan

Duncan Coutts

5 Jul 5 Jul

4:59 p.m.

New subject: Re[2]: Binary serialization, was Re: Abstraction leak

On Thu, 2007-07-05 at 17:39 +0100, Paul Moore wrote:

...

I see you've already responded, and we're in broad agreement. So I won't labour the point. It's an infrastructure issue rather than a technical one, and it *will* improve. What will be interesting is how much the generally lousy Windows experience can be improved -

We have this slightly odd problem where half our user base use Windows (according to the GHC user survey) but almost every active developer uses Linux or OSX (or a few other BSD/Unix OSs). So we could enormously improve the Windows user experience (and a few heroes work hard on doing just that) but basically there just aren't enough developers who use windows to give it a satisfactory level of support. This is of course a slightly circular problem, since using/developing Haskell on Windows is a pain, developers avoids it and so there are not enough developers irritated by how difficult it is to motivate developers to fix it! Duncan

Dave Bayer

4:41 p.m.

New subject: Binary serialization, was Re: Abstraction leak

On Jul 5, 2007, at 8:00 AM, Paul Moore wrote:

...

It probably depends on your perspective. I've found lots of tasks that would be a simple library call in Python, but which require me to write the code myself in Haskell. Examples:

* Calculate the MD5 checksum of a file

How's this, only one line is specific to your problem:

...

import System.Process import IO

doShell :: String -> IO String doShell cmd = do (_,out,_,_) <- runInteractiveCommand cmd hGetContents out

main :: IO () main = do md5 <- doShell "md5 -q md5.hs" putStrLn md5

It's not like you'll be kicked out of the tree house for leaving the Haskell world to get things done. For example, ghostscript and pdf2ps are well-supported open source tools for converting PS to PDF, that can be called from most languages. What's the deal with everyone rewriting PDF handling in their pet language, when it's so much easier to generate Postscript? I'd call that Balkanization; if I were managing a software group, I'd never let that happen. The true problem isn't adequate libraries in each language, it's interoperability so great open-source tools can get written once and then be supported by a cast of thousands. There are people who claim with a straight face that they migrated to OS X primarily to use TextMate http://www.textmate.com which is a GUI editor getting Emacs-like buzz, making Emacs seem by comparison like your grandfather's razor. It's as much a text-based operating system as an editor, and the whole thing is glued together with hundreds of snippets of code one can hack, written in every scripting language imaginable. Polyglots feel right at home...

Paul Moore

4:52 p.m.

New subject: Binary serialization, was Re: Abstraction leak

On 05/07/07, Dave Bayer wrote:

...

How's this, only one line is specific to your problem: [...]

...
md5 <- doShell "md5 -q md5.hs"

Doesn't work on my (Windows) PC, where I have no md5 command available. While I agree in theory with the idea of combining focused tools, it's a potential portability nightmare, particularly on Windows where decent command line tools are almost non-existent on a clean install. You're changing the problem from finding a Haskell library (which only needs to be installed on the development machine at compile time) to finding a 3rd party utility, which has to be installed at runtime on any machine using the compiled Haskell program. Not a good trade-off. And I'm not going to get into Windows/Linux arguments... :-) Paul.

Dave Bayer

7:59 p.m.

New subject: Binary serialization, was Re: Abstraction leak

On Jul 5, 2007, at 9:52 AM, Paul Moore wrote:

...

You're changing the problem from finding a Haskell library (which only needs to be installed on the development machine at compile time) to finding a 3rd party utility, which has to be installed at runtime ... Not a good trade-off.

The intersection of Linux and Mac OS X is a pretty amazing standard library, that beats any single scripting language. I'd forgotten how dismal Windows is, sorry. Still, if you stick to non-GPL'd licenses, there's no way to build single file deliverables? I'd think someone would have written "Unix as a static library", the way e.g. many languages can be embedded in apps. Then only you would have to maintain the Unix tools you want to use, and you'd be done. If no one has, someone who cares about Windows should. Unix rocks. On Jul 5, 2007, at 9:54 AM, Philip Armstrong wrote:

...

Presumably you mean http://macromates.com/ ?

Yup. Sorry.

Philip Armstrong

4:54 p.m.

New subject: Binary serialization, was Re: Abstraction leak

On Thu, Jul 05, 2007 at 09:41:23AM -0700, Dave Bayer wrote:

...

There are people who claim with a straight face that they migrated to OS X primarily to use TextMate

http://www.textmate.com

Presumably you mean http://macromates.com/ ? Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

dons＠cse.unsw.edu.au

11:41 p.m.

New subject: Binary serialization, was Re: Abstraction leak

p.f.moore:

...

On 05/07/07, Jonathan Cast wrote:

...
Can't say I agree. I've been learning Python, and have been very un-impressed so far with its library coverage, which I would rate no better than (in terms of the POSIX bindings, worse than) Haskell.

It probably depends on your perspective. I've found lots of tasks that would be a simple library call in Python, but which require me to write the code myself in Haskell. Examples:

* Send an email

Sounds like a job for MissingH?

...

* Parse an ini file

Probably have to write your own Parsec-based parser here.

...

* Gzip compress a data stream

We have a wonderful library for this! http://hackage.haskell.org/cgi-bin/hackage-scripts/package/zlib-0.3

...

* Calculate the MD5 checksum of a file

In the Crypto library, or use the openssl binding posted a couple of days ago: http://thread.gmane.org/gmane.comp.lang.haskell.cafe/24165/focus=24170

...

(Of course, I may just not have found the relevant library - that says something about discoverability rather than coverage, I guess).

Find more libraries on hackage: http://hackage.haskell.org/packages/archive/pkg-list.html

...

For bindings, Python's Windows bindings (pywin32) are superb, where Haskell's are minimal and unmaintained. Of course, that won't matter to you if you use POSIX...

...
The one thing off the top of my head that Python had was Base64, but that's 20 lines of Haskell tops. Aside from that, nothing.

But that's 20 lines of code I don't want to write, and more, I don't know how to write (without looking up the definition of Base64). Having lots of these seemingly trivial helpers available "out of the box" is what library coverage means to me. (And Python does have lots of these - I don't know how Haskell fares in practice).

I'm not trying to start (or fan) a flamewar, but it's interesting how different people's perspectives on libraries can be...

-- Don

Dominic Steinitz

8:41 p.m.

New subject: Binary serialization, was Re: Abstraction leak

...

...
(Bonus points for being able to parse ASN.1 and generate appropriate Haskell datatypes & serialization primitives automatically :-) )

I think there's at least an ASN.1 definition in the crypto library. Dominic might be able to enlighten us on that.

No bonus points I'm afraid. There is an embryonic ASN.1 library here http://hackage.haskell.org/cgi-bin/hackage-scripts/package/ASN1-0.0.1 You can take an ASN.1 definition such as FooBar {1 2 0 0 6 1} DEFINITIONS ::= BEGIN Journey ::= SEQUENCE { origin IA5String, stop1 [0] IA5String OPTIONAL, stop2 [1] IA5String OPTIONAL, destination IA5String } Odyssey ::= SEQUENCE { start Journey, trip1 [0] Journey OPTIONAL, trip2 [1] Journey OPTIONAL, trip3 [2] Journey OPTIONAL, end Journey } END And then create abstract Haskell representations of the ASN.1 types journey = "Journey" ::= AbsSeq Universal 16 Implicit [ Regular (Just "origin" :>: (Nothing :@: absIA5String)), Optional (Just "stop1" :>: (Just 0 :@: absIA5String)), Optional (Just "stop2" :>: (Just 1 :@: absIA5String)), Regular (Just "destination" :>: (Nothing :@: absIA5String)) ] odyssey = "Odyssey" ::= AbsSeq Universal 16 Implicit [ Regular (Just "start" :>: (Nothing :@: journey)), Optional (Just "trip1" :>: (Just 0 :@: journey)), Optional (Just "trip2" :>: (Just 1 :@: journey)), Optional (Just "trip3" :>: (Just 2 :@: journey)), Regular (Just "end" :>: (Nothing :@: journey)) ] The library then allows you to decode BER representations of these types. It's good enough to decode X.509 identity and attribute certificates. There's no encoding routines currently as I didn't need them. I'll try and make some documentation more easily available if I get time at the weekend. I'm working on PER at the moment both encoding and decoding using GADTs. I will extend it at some point for BER but that won't be for some time. I thought I read that someone was working on parsing ASN.1 so I'll try and follow that up (again it will have to be at the weekend). Dominic.

Thomas Conway

4 Jul 4 Jul

11:25 a.m.

...

Anyone trying to do any of this?

I've done some work in this area. I'm particularly interested in manipulating ASN.1 in haskell. Actually, my first use of Parsec was an ASN.1 parser. I'd done one previously in Spirit (the Boost C++ rip-off of parsec), but semantic actions were horrible in the extreme. Mmmm Parsec. In the indexing system I'm currently building in Haskell for my day job, I'm serializing several data structures, and using Data.Bits and Data.ByteString heavily. I was using HaXml, but I found it was very slow. So instead, I'm using an internal (within the indexing system) representation that is more akin to WBXML: import Data.ByteString as ByteString import Data.List as List import Data.Sequence as Seq data DocTree = DocElem ByteString [(ByteString,ByteString)] [DocTree] | DocText ByteString serialize tree = ByteString.concat $ Seq.toList $ execState (serialize' tree) Seq.empty serialize' (DocText txt) = do stuff <- get put (stuff |> pack [0]) putStr txt serialize' (DocElem name attrs kids) = do stuff <- get put (stuff |> pack [1]) putStr name putNum (List.length attrs) mapM_ (putPair putStr putStr) attrs putNum (List.length kids) mapM_ serialize' kids putStr .... You get the idea. Actually, the *real* code is trickier - it grovels first to find all the element names and numbers them. Likewise with attribute names (per element). The extra grovel is well worth it - it takes a little longer to serialize, but is more compact and deserializes quicker. Also worth noting - whether you compile a dictionary of element names or not, the result is much much much more space efficient than using HaXml, since it can all be decoded out of a single ByteString containing the document tree, with no actual string copying at all. That's the kind of [de]serialization I like. :-) Mind you, I still have to use HaXml when I first read documents into the system, and a very nice job it does too. T. -- Dr Thomas Conway drtomc@gmail.com Silence is the perfectest herald of joy: I were but little happy, if I could say how much.

dons＠cse.unsw.edu.au

11:32 a.m.

drtomc:

...

...
Anyone trying to do any of this?

I've done some work in this area. I'm particularly interested in manipulating ASN.1 in haskell. Actually, my first use of Parsec was an ASN.1 parser. I'd done one previously in Spirit (the Boost C++ rip-off of parsec), but semantic actions were horrible in the extreme. Mmmm Parsec.

In the indexing system I'm currently building in Haskell for my day job, I'm serializing several data structures, and using Data.Bits and Data.ByteString heavily.

I was using HaXml, but I found it was very slow. So instead, I'm using an internal (within the indexing system) representation that is more akin to WBXML:

import Data.ByteString as ByteString import Data.List as List import Data.Sequence as Seq

data DocTree = DocElem ByteString [(ByteString,ByteString)] [DocTree] | DocText ByteString

serialize tree = ByteString.concat $ Seq.toList $ execState (serialize' tree) Seq.empty serialize' (DocText txt) = do stuff <- get put (stuff |> pack [0]) putStr txt serialize' (DocElem name attrs kids) = do stuff <- get put (stuff |> pack [1]) putStr name putNum (List.length attrs) mapM_ (putPair putStr putStr) attrs putNum (List.length kids) mapM_ serialize' kids

putStr ....

You get the idea. Actually, the *real* code is trickier - it grovels first to find all the element names and numbers them. Likewise with attribute names (per element). The extra grovel is well worth it - it takes a little longer to serialize, but is more compact and deserializes quicker.

Also worth noting - whether you compile a dictionary of element names or not, the result is much much much more space efficient than using HaXml, since it can all be decoded out of a single ByteString containing the document tree, with no actual string copying at all. That's the kind of [de]serialization I like. :-) Mind you, I still have to use HaXml when I first read documents into the system, and a very nice job it does too.

Can we do a cheap bytestring binding to libxml, to avoid any initial String processing? -- Don

Malcolm Wallace

4:24 p.m.

Andrew Coppin wrote:

...

While we're on the subject... am I the first person to notice that Haskell doesn't appear to have much support for fiddling with streams of bits?

See "The Bits Between The Lambdas: Binary Data in a Lazy Functional Language" Malcolm Wallace and Colin Runciman, International Symposium on Memory Management, Vancouver, Canada, Oct 1998 ftp://ftp.cs.york.ac.uk/pub/malcolm/ismm98.html Regards, Malcolm

6573

Age (days ago)

6580

Last active (days ago)

List overview

Download

62 comments

21 participants

participants (21)

Andrew Coppin
apfelmus
Bulat Ziganshin
Dave Bayer
David Roundy
Dominic Steinitz
dons＠cse.unsw.edu.au
Duncan Coutts
Ian Lynagh
Jon Cast
Jon Fairbairn
Jonathan Cast
Josef Svenningsson
Ketil Malde
Malcolm Wallace
Miles Sabin
Neil Mitchell
Paul Moore
Philip Armstrong
Stefan O'Rear
Thomas Conway