RE: getting a Binary module into the standard libs

This doesn't seem like an awful lot of work, and if it would help getting Binary into the hier libs, I'd be more than happy to do it. Given Eray's comments, it might turn out to be faster.
Great!
Simon: I have a version of the GHC Binary module, but I know that I've mucked around with it quite a bit. I've also got the one off of cvs in ghc/compiler/utils/Binary.hs, which would probably be a better (read: safer) starting place. Basically we want to add two functions:
putBit :: BinHandle -> Bool -> IO () getBit :: BinHandle -> Bool -> IO ()
I'd do it this way: putBits :: BinHandle -> Int{-size-} -> Int{-value-} -> IO () and similarly for getBits. It will be easiest if the size is not allowed to go over 8, because then you have to deal with endianness, and in any case we already have put for Int16, Int32 etc. written in terms of putWord8. Currently the binary format is endian-independent for the basic integral types. If you use Int, then a binary file written on one machine is still only useable on a machine of the same word size, but if you want a truly mobile binary file you can restrict yourself to the explicitly sized integral types. I think this is a nice property to keep.
It seems that in order to accomplish this, the BinMem constructor needs to be augmented with two fields, one Word8 which contains bits which have been "put" but haven't yet been written to the array and another Word8 which stores the current bit position we are at in this Word8. Then, the work comes down mostly to bit-twiddling in the putWord8 and putBit functions (putBit being the simpler of the two). It seems the BinIO constructor would require basically the identical thing, which means perhaps this stuff should be added to the BinHandleState variable.
BinMem and BinIO differ quite a bit here: for BinMem you can write straight into the array, whereas for BinIO we need a cache - a single byte at the least, but ideally more. BinMem is the most important case to optimise (for us in GHC anyhow), since BinIO is already significantly slower due to the overhead of the Handle interface. There should really be a closeBin function too; it's quite simple to add. Cheers, Simon

I'd do it this way:
putBits :: BinHandle -> Int{-size-} -> Int{-value-} -> IO ()
and similarly for getBits. It will be easiest if the size is not allowed to go over 8, because then you have to deal with endianness, and in any case we already have put for Int16, Int32 etc. written in terms of putWord8.
Any reason not to make those Word8s instead of Ints? That way we don't have to throw errors when things are too big; additionally, Word implies more of a bit-wise tihng to me than Int, about which I always worry about whether shifts are going to mess with my sign bit, etc... I imagine 'putBits bh 3 word' means to write the 3 least significant bits of word? That is, 'putBits bh 3 (32 + 7)' is the same as 'putBits bh 3 7', right?
BinMem and BinIO differ quite a bit here: for BinMem you can write straight into the array, whereas for BinIO we need a cache - a single byte at the least, but ideally more. BinMem is the most important case to optimise (for us in GHC anyhow), since BinIO is already significantly slower due to the overhead of the Handle interface.
Currently the BinIO implementation doesn't do any caching, though, right? I actually rarely use the BinMem implementation, so I would be more interested in improving the speed of BinIO. Presumably, however, the Handle actually does buffering, which should take care of a large part of this, right?
There should really be a closeBin function too; it's quite simple to add.
On files this would flush the current byte then close the file handle, right? What would it do on BinMems (other than flush the byte)?
participants (2)
-
Hal Daume III
-
Simon Marlow