[Haskell-cafe] Architecturally flawed

10 Jul 2008

      OK, so I just spent an entire day trying to write some code, and after 
hours of struggling I have something that's semi-working but very ugly 
and rather unreliable. My usual guideline for Haskell programming is "if 
it seems hard, you're doing it wrong"...

After many hours of effort, I came up with these:

  data Writer x
  instance Monad Writer
  run_writer :: Writer () -> ByteString
  write1 :: Bool -> Writer ()
  write8 :: Word8 -> Writer ()
  write16 :: Word16 -> Writer ()
  write32 :: Word32 -> Writer ()
  write64 :: Word64 -> Writer ()
  writeN :: Word8 -> Word32 -> Writer ()

  data Reader x
  instance Monad Reader
  run_reader :: Reader x -> ByteString -> x
  is_EOF :: Reader Bool
  read1 :: Reader Bool
  read8 :: Reader Word8
  read16 :: Reader Word16
  read32 :: Reader Word32
  read64 :: Reader Word64
  readN :: Word8 -> Reader Word32

Notice that, unlike the Binary package, all of these are bit-aligned 
rather than byte-aligned. This is what took so much effort. I strived to 
make each operation as efficient as possible - by performing as few 
steps as possible. It would have been *far* easier to, say, implement 
read8 by calling read1 eight times and then packing the bits back into a 
byte. But that wouldn't be very fast. So actually read8 reads a pair of 
bytes, bit-shifts them as required, and then ANDs them together. And so 
on. (Note that the Writer monad uses Data.Binary.Builder (only) from the 
Binary package. I assume this is more efficient...)

I even sat down and wrote a battery of QuickCheck properties because - 
as you can imagine - there's a hell of a lot to go wrong here! After 
many hours, I managed to get it to pass all tests...

Next I decided to write an LZW compressor. Here's what I came up with:

  data EncodeLZW symbol
  eLZW_start :: (Ord symbol) => EncodeLZW symbol
  eLZW_encode :: (Ord symbol) => EncodeLZW symbol -> symbol -> Writer 
(EncodeLZW symbol)
  eLZW_end :: (Ord symbol) => EncodeLZW symbol -> Writer ()

  data DecodeLZW symbol
  dLZW_start :: DecodeLZW symbol
  dLZW_decode :: DecodeLZW symbol -> Reader (DecodeLZW symbol, symbol)

So each step of the encoder (possibly) writes some bits, and also 
returns a new encoder state. The decoder is similar. (At least one 
program bug was caused by the wrong version of the state being used 
somewhere!)

Then I tried to run the code, and oh look, there's a bug somewhere.

Emphasis "somewhere". At this point I have several KB of code and no 
idea how in hell to test it. What I *want* to do is single-step through 
the encoder algorithm, watch it write bits out and update state, and 
then single-step through the decoder watching it read bits in and update 
its state, etc. But I can't. All the encoder does is return a giant 
Writer object which you can't look inside, only run. And likewise for 
the decoder.

I could try GHC's new debugger. But my experiences with it so far have 
shown that for all but the most trivial programs possible, it becomes 
intractably difficult to figure out what the debugger is actually 
showing you. I actually tried to debug a "normal" LZW implementation 
once - one that didn't involve two highly convoluted custom monads and a 
stateful execution model with manually threaded state. This is *not* my 
idea of a fun time...

In a "normal" programming language, at this point you would sprinkle a 
few print statements around so you can see the intermediate steps the 
program is taking. But I can't. I'm in the wrong monad. Curses!

In the end, the only solution I could come up with was to design a 
second, "hacked" version of the two monads. Instead of importing one 
module, you import another that provides the same interface. However, 
now Reader and Writer are aliases to IO. All write requests cause 
pretty-printed binary to be sent to stdout, and all read requests pop up 
a prompt for input from stdin. It worked reasonably well in that I could 
now add the vitally necessary print statements, and see intermediate 
values and so forth... It wasn't very pretty though.

At this point I decided that lugging 5 individual functions around was 
silly, and I should write a class:

  class Encoder e where
    encoder_start :: e x
    encode :: e x -> x -> Writer (e x)
    encoder_end :: e x -> Writer ()

  class Decoder d where
    decoder_start :: d x
    decode :: d x -> Reader (d x, x)

10 points to the first person to spot the fatal flaw in this plan...

Yep, that's right. The type system won't accept this. The reason? x 
needs an Ord context.

If you turn on multi-parameter type classes *and* flexible instances, 
apparently this works:

  class Encoder e x where...

  instance (Ord x) => Encoder EncodeLZW x where...

(I haven't tried but I *presume* it means what I think it does...)

Alternatively, you can avoid using dangerous type system extensions and 
just write

  class Encoder e where
    encode :: (Symbol x) => e x -> x -> Writer (e x)
    ...

and make Symbol encompass everything any possible encoder could want. 
(In reality, x is almost always Word8 or something. But I dislike loss 
of generality just for the hell of it...)

So at least I got that part fixed. Heh. But now I find myself worrying 
about yet *another* problem: is Writer lazy enough?

I mean, the idea is that you can write a program that lazily reads from 
a file, pushes it through a Writer, and lazily writes the result back to 
another file. The thing should chug along reasonably fast and use a 
constant amount of RAM. But all this is only true of Writer is actually 
lazy enough. I have a horrible feeling that all the complicated Origami 
inside makes it too strict. And I have no way to check!

Actually, thinking about it, is Reader lazy enough? You call run_reader 
and it hands over your data, but if that data is, say, a list, will 
run_reader build the entire damn list before it'll hand it over? Or will 
the monadic code by called as-needed to generate the list elements? 
Obviously I desire the latter, but I'm really not sure what the actual 
behaviour is...

In summary, I've spent several days coding, and I still don't have a 
single algorithm working. I've spent all my time wrestling with the 
mundane details of infrastructure rather than the interesting algorithms 
I actually set out to implement. This makes me very sad. :-(

If anybody can think of a better set of abstractions or another way to 
do debugging, I'd be interested to hear about it.

(This would all be so trivial in an OO language. Just make an Encoder 
object that updates its own state internally and talks to a Source 
object and a Destination object for its data...)

[Haskell-cafe] Architecturally flawed

Andrew Coppin