
So lately I've been working on a little program to generate trippy graphics. (Indeed, some of you may remember it from a few years back...) Anyway, getting to the point, I just restructured my program. As part of the restructuring, I got rid of all the jiggery-pokery with Data.Array.Storable and so forth and decided to use Data.Binary. My first problem was quite simple. I want to write a program that reads an item from the input file, processes it, writes it to the output file, and repeats until the input file is empty. Data.Binary doesn't appear to provide any obvious way to do this. You can't be in the Get and Put monads simultaneously, and I can't figure out how to interleave them. In the end, I ended up with something like xs <- decodeFile f1 encodeFile f2 (map f xs) Unfortunately, it looks like this doesn't do what I want. Serialised lists have their length prepended to them, which means that encodeFile is going to have to generate the *entire* list in memory so it can count how big it is, before it can output a single byte to disk. (!) This is particularly annoying since the output list will be exactly the same size as the input one, and the very first thing decodeFile is doing is reading in that size figure. Maybe there's a way around this glitch, and I just haven't thought of it yet. But you'd think that wanting to lazily process data from an input file and write it into an output file would be an extremely common use-case. Given that, I'm not seeing an obvious way to handle it. Anyway, that's one small problem, but I can live with it. There's another, much bigger problem though, and that's really what I wanted to talk about... As we all know, Data.Binary is *built* for speed. So imagine my shock when my newly adjusted program now based around this library turned out to be massively slower, and the files it produced were drastically larger. It didn't take me that long to dig up the problem. The problem seems to boil down to this: The Binary instance for Double (and Float, by the way) is... well I guess you could argue it's very portable, but efficient it isn't. As we all know, an IEEE-754 double-precision floating-point number occupies 64 bits; 1 sign bit, 11 exponent bits, and 52 mantissa bits (implicitly 53). I had assumed that the Binary instance for Double would simply write these bits to disk, requiring approximately 0 computational power, and exactly 64 bits of disk space. I was wrong. ...what it *actually* does is convert the 64-bit Double into a 32-bit exponent and an arbitrary-precision integer part. (!) It appears to do this by floating-point arithmetic rather than just low-level bit shuffling. Looking up how arbitrary-precision integers are serialised, I see that there's an 8-bit "flag" indicating whether the integer fits into 32 bits, and if it doesn't, there's an 8-bit sign value, followed by a serialised list of bytes. So, in other words, a 32-bit length value, followed by the bytes themselves. To summarise, when I ask for a 64-bit Double to be serialised, I get this: 32-bit exponent. 8-bit flag (probably always "1") 8-bit sign (either "0" or "1") 32-bit size (probably always "8") 64-bit mantissa That's 144 bits in total, i.e., 225% larger than the original Double. (Let's not even go into how much computer power it takes to construct all this data before it's written to disk...) So, that's why my files balooned in size, and why my program slowed to a crawl. But how do I fix this? Well, my solution was simple, brutal, and probably *highly* non-portable. It begins with Unsafe.Coerce (that's never a good thing!) Put simply, if you take your 64-bit Double and unsafeCoerce it into a Word64 and then ask Data.Binary to serialise that, it writes it straight to disk without screwing around with it first. The result ought to be portable to any architecture that uses IEEE-754 arithmetic natively. (Anybody know of an arch this *doesn't* apply to?) But sure, if you wanted to do this on some other architecture with a different native float format, you'd have to write some trixy code to handle it. (And then add conditional compilation for it.) Is there any danger that there might be some kind of improvement to the Double instance in the next version of Data.Binary? END.