RE: getting a Binary module into the standard libs

- How important is this?
The motivation for my Binary library was to save space, whereas Simon's motivation was to be fast (at any rate to be faster than parsing text). Thus, a bitstream is potentially far more compact than a bytestream, depending of course on the natural size of the objects to be serialised. But the tradeoff is that a bytestream is far quicker to build/read, because there is no tricky logic required to ensure that bits are shifted to the right place etc.
Different applications will require different characteristics. There is no one-size-fits-all.
I should add that I never actually benchmarked the speed difference between using bits and bytes, and I don't believe it's clearly a win to use bytes. There's the time to get the data off the disk, caching effects, etc. to take into account, all of which tend to be more important the more recent your machine is. I'd be delighted if someone would add support for bits to GHC's Binary library (to get a fair comparison) and did some measurements. It may be that there's no need to add an abstraction layer to provide support for both bits and bytes. Cheers, Simon

On Monday 11 November 2002 11:54, Simon Marlow wrote:
I should add that I never actually benchmarked the speed difference between using bits and bytes, and I don't believe it's clearly a win to use bytes. There's the time to get the data off the disk, caching effects, etc. to take into account, all of which tend to be more important the more recent your machine is.
Nowadays, the bandwidth inside the ALUs (most chips are superscalar) beat the
memory bandwidth by far so you have time to interleave any memory transfer
code with a good number of ops.
I remember that we could do that on the MC68000 8 years ago and a good FORTRAN
compiler should be knowing about this fact of modern architectures for more
than a decade.
Actually if you can achieve a significant saving in space, the bitstream
approach would be faster!
Regards,
--
Eray Ozkural (exa)

Simon,
I'd be delighted if someone would add support for bits to GHC's Binary library (to get a fair comparison) and did some measurements. It may be that there's no need to add an abstraction layer to provide support for both bits and bytes.
This doesn't seem like an awful lot of work, and if it would help getting Binary into the hier libs, I'd be more than happy to do it. Given Eray's comments, it might turn out to be faster. Simon: I have a version of the GHC Binary module, but I know that I've mucked around with it quite a bit. I've also got the one off of cvs in ghc/compiler/utils/Binary.hs, which would probably be a better (read: safer) starting place. Basically we want to add two functions: putBit :: BinHandle -> Bool -> IO () getBit :: BinHandle -> Bool -> IO () It seems that in order to accomplish this, the BinMem constructor needs to be augmented with two fields, one Word8 which contains bits which have been "put" but haven't yet been written to the array and another Word8 which stores the current bit position we are at in this Word8. Then, the work comes down mostly to bit-twiddling in the putWord8 and putBit functions (putBit being the simpler of the two). It seems the BinIO constructor would require basically the identical thing, which means perhaps this stuff should be added to the BinHandleState variable. It might be wise also to add a function like: flushByte :: BinHandle -> IO () which essentially moves you forward to the next byte position. Of course both reading and writing would have to call this at the same type. There's a concern that if you only write, say, 2 of the last 8 bits, this won't actually get written unless you call flushByte. I don't know how this was handled in the NHC version; perhaps just require the user to call flushByte at the end if they're worried about such behavior? - Hal

Hal Daume III
It seems that in order to accomplish this, the BinMem constructor needs to be augmented with two fields, one Word8 which contains bits which have been "put" but haven't yet been written to the array and another Word8 which stores the current bit position we are at in this Word8. Then, the work comes down mostly to bit-twiddling in the putWord8 and putBit functions (putBit being the simpler of the two). It seems the BinIO constructor would require basically the identical thing, which means perhaps this stuff should be added to the BinHandleState variable.
That's basically what we did with the nhc98 implementation, yes, except that the BinHandle was originally implemented in C rather than Haskell.
It might be wise also to add a function like: flushByte :: BinHandle -> IO ()
Yep.
There's a concern that if you only write, say, 2 of the last 8 bits, this won't actually get written unless you call flushByte. I don't know how this was handled in the NHC version; perhaps just require the user to call flushByte at the end if they're worried about such behavior?
The final-byte flush should be automatic when the BinHandle is closed. One very tricky issue is what to do with a bitstream that has been written in a prior program run, with the final byte incomplete, and is then re-opened in AppendMode. We used to have a little trick that added an extra three-bit value to the end of the final byte of the actual file. This represented the position of the last "real" bit in the stream as an offset backwards from the final bit in the file (allowing for the three-bit tag itself of course). Later, we realised that this solution was kind of nasty because it adds meta-data to the binary stream, which is OK if the only code that ever processes the bits is in Haskell, but if it is any kind of external format, e.g. JPEG, then it really screws you up. I think the real solution is to forbid AppendMode. Regards, Malcolm

Malcolm,
There's a concern that if you only write, say, 2 of the last 8 bits, this won't actually get written unless you call flushByte. I don't know how this was handled in the NHC version; perhaps just require the user to call flushByte at the end if they're worried about such behavior?
The final-byte flush should be automatic when the BinHandle is closed.
The GHC version lacks a close function. Presumably this is the 'closeBin' function which is commented out in the export list. All this would need to do is call flushByte, assuming we don't allow AppendMode? Though if we force them to call flushByte at the end, instead of some otherly-named function, presumably they could use AppendMode as long as they realize that they need to flushBytes at the end of any write. I don't particularly care...I don't find appending binary files to be very useful.

It might be wise also to add a function like:
flushByte :: BinHandle -> IO ()
Interfaces to other things with alignment constraints (e.g., memory allocators) often have one of two generalizations: 1) (Most likely to be useful): --| flushBytes h n aligns the strean to the next 2^n byte (bit?) boundary flushBytes :: BinHandle -> Int -> IO () 2) (Less likely to be useful in this context) --| flushBytes h m n aligns the stream such that the position p satisfies: -- p `mod` 2^m == n flushBytes :: BinHandle -> Int -> IO () I'd guess that the former could be useful and that the latter is excessive for this case. -- Alastair

Alastair Reid wrote:
Interfaces to other things with alignment constraints (e.g., memory allocators) often have one of two generalizations:
1) (Most likely to be useful):
--| flushBytes h n aligns the strean to the next 2^n byte (bit?) boundary flushBytes :: BinHandle -> Int -> IO ()
2) (Less likely to be useful in this context)
--| flushBytes h m n aligns the stream such that the position p satisfies: -- p `mod` 2^m == n flushBytes :: BinHandle -> Int -> IO ()
Years (decades?) ago it occurred to me that the above two options for expressing alignment could neatly (at least in terms of the interface) be coalesced, as follows. An alignment constraint for allocation in a memory area is expressed as an unsigned integer of the same size as addresses for that memory area. An alignment constraint of zero is not sensible. (It could be used to indicate a default value or could evoke an error, depending on the application.) The alignment constraint encodes the values `m` and `n` from the second alternative above as `2^m + n`. Operationally, the combined alignment constraint is split at the uppermost "1" bit, with that bit's position indicating the power of two and the lower bits indicating the desired remainder upon division by that power of two. This scheme compactly encodes all meaningful pairs of values of `m` and `n` and has only a single inherently meaningless combined value--the aforementioned value zero. I've never had the opportunity to use or implement such an interface, but I have occasionally missed the generality it provides when I had to allocate extra space in order to place a descriptor immediately before a well-aligned bunch of bits. Has anyone seen such an interface? -- Dean
participants (6)
-
Alastair Reid
-
Dean Herington
-
Eray Ozkural
-
Hal Daume III
-
Malcolm Wallace
-
Simon Marlow