question about Data.Binary and Double instance

Hi all, I'm wondering what exactly inspired the decode/encodeFloat implementation for Data.Binary? It seems to me like it'd be much better to use a standard format like IEEE, which would also be much more efficient, since as far as I know, on every implementation a Double and a CDouble are identical. Are there any suggestions how I could use Data.Binary to actually read a binary file full of Doubles? Should I just use the Array interface, and forget laziness and hopes of handling different-endian machines? Or is there some way to reasonably do this using Data.Binary? -- David Roundy Department of Physics Oregon State University

On Tue, Apr 17, 2007 at 10:32:02AM -0700, David Roundy wrote:
I'm wondering what exactly inspired the decode/encodeFloat implementation
I kind of wondered the same thing when I first saw it. Looks like it was just the quickest way to get it going.
Are there any suggestions how I could use Data.Binary to actually read a binary file full of Doubles? Should I just use the Array interface, and forget laziness and hopes of handling different-endian machines? Or is there some way to reasonably do this using Data.Binary?
I threw together a somewhat portable "longBitsToDouble" function a while ago for another project. http://darcs.brianweb.net/hsutils/src/Brianweb/Data/Float.lhs It doesn't depend on any unsafe operations or external ffi functions but it will only works on IEEE 754 machines (but that includes every machine ghc run on). It might not be fast enough for you though as it still goes via Integer in the conversion. -Brian

On Tue, Apr 17, 2007 at 02:50:14PM -0400, Brian Alliet wrote:
I threw together a somewhat portable "longBitsToDouble" function a while ago for another project.
http://darcs.brianweb.net/hsutils/src/Brianweb/Data/Float.lhs
It doesn't depend on any unsafe operations or external ffi functions but it will only works on IEEE 754 machines (but that includes every machine ghc run on). It might not be fast enough for you though as it still goes via Integer in the conversion.
It seems like this conversion shouldn't take any time at all, and we ought to be able to just copy the memory right over, or just do a unsafeCoerce# (which is admittedly unsafe, but in practice between a Word64 and a Double should be fine)... -- David Roundy Department of Physics Oregon State University

On Tue, Apr 17, 2007 at 12:18:29PM -0700, David Roundy wrote:
machine ghc run on). It might not be fast enough for you though as it still goes via Integer in the conversion.
It seems like this conversion shouldn't take any time at all, and we ought to be able to just copy the memory right over, or just do a unsafeCoerce# (which is admittedly unsafe, but in practice between a Word64 and a Double should be fine)...
True. I only wrote it that way so I wouldn't have to muck with low level details between haskell implementations. The right thing to do for Data.Binary is probably to just peek the Double off the ForeignPtr in the ByteString, no sense going through Word64 at all. -Brian

On Tue, 2007-04-17 at 10:32 -0700, David Roundy wrote:
Hi all,
I'm wondering what exactly inspired the decode/encodeFloat implementation for Data.Binary? It seems to me like it'd be much better to use a standard format like IEEE, which would also be much more efficient, since as far as I know, on every implementation a Double and a CDouble are identical.
Are there any suggestions how I could use Data.Binary to actually read a binary file full of Doubles? Should I just use the Array interface, and forget laziness and hopes of handling different-endian machines? Or is there some way to reasonably do this using Data.Binary?
Hi David, We'd like to use IEEE format as the default Data.Binary serialisation format for Haskell's Float and Double type, the only thing that makes this tricky is doing it portably and efficiently. We can't actually guarantee that we have any IEEE format types available. The isIEEE will tell you if a particular type is indeed IEEE but what do we do if isIEEE CDouble = False ? Perhaps we just don't care about ARM or other arches where GHC runs that do not use IEEE formats, I don't know. If that were the case we'd say something like: instance Binary Double where put d = assert (isIEEE (undefined :: Double)) $ do write (poke d) If we do care about ARM and the like then we need some way to translate from the native Double encoding to an IEEE double external format. I don't know how to do that. I also worry we'll end up with lots of #ifdefs. The other problem with doing this efficiently is that we have to worry about alignment for that poke d operation. If we don't know the alignment we have to poke into an aligned side buffer and copy over. Similar issues apply to reading. I'm currently exploring more design ideas for Data.Binary including how to deal with alignment. Eliminating unnecessary bounds checks and using aligned memory operations also significantly improves performance. I can get up to ~750Mb/s serialisation out of a peak memory bandwidth of ~1750Mb/s, though a Haskell word-writing loop can only get ~850Mb/s. Duncan

On Wed, Apr 18, 2007 at 12:34:58PM +1000, Duncan Coutts wrote:
We'd like to use IEEE format as the default Data.Binary serialisation format for Haskell's Float and Double type, the only thing that makes this tricky is doing it portably and efficiently.
You should note that your current method of serializing Doubles (encodeFloat/decodeFloat) isn't portable either as the results of these functions depend on floatRadix. So using some method that depends on IEEE representation isn't much worse (might actually be better as you'd get an error at runtime rather than writing data that could potentially be read back as garbage). I think the only way to do this 100% portably is to encode it as a Rational before serializing. Also even if someone were to bother to write the code to convert from an arbitrary floating point rep to IEEE for serialization you'd run the risk losing information if the hosts floating point rep was more accurate that IEEE FP. It seems like it might come down to making a choice between 100% portable but incompatable with the default serialization mechanisms in other languages or non-portable (but ok for just about every popular arch used today) and compatable with other languages.
Perhaps we just don't care about ARM or other arches where GHC runs that
Are there really any architectures supported by GHC that don't use IEEE floating point? If so GHC.Float is wrong as isIEEE is always true. -Brian

On Tue, Apr 17, 2007 at 11:42:40PM -0400, Brian Alliet wrote:
Perhaps we just don't care about ARM or other arches where GHC runs that
Are there really any architectures supported by GHC that don't use IEEE floating point? If so GHC.Float is wrong as isIEEE is always true.
The one most likely to be non-IEEE is ARM, which has a middle-endian representation; to make it explicit, it's the middle case here (FLOAT_WORDS_BIGENDIAN but not WORDS_BIGENDIAN): #if WORDS_BIGENDIAN unsigned int negative:1; unsigned int exponent:11; unsigned int mantissa0:20; unsigned int mantissa1:32; #else #if FLOAT_WORDS_BIGENDIAN unsigned int mantissa0:20; unsigned int exponent:11; unsigned int negative:1; unsigned int mantissa1:32; #else unsigned int mantissa1:32; unsigned int mantissa0:20; unsigned int exponent:11; unsigned int negative:1; #endif #endif Does anyone know if that makes it non-IEEE? Thanks Ian

On Sun, Apr 22, 2007 at 10:43:23PM +0100, Ian Lynagh wrote:
On Tue, Apr 17, 2007 at 11:42:40PM -0400, Brian Alliet wrote:
Perhaps we just don't care about ARM or other arches where GHC runs that
Are there really any architectures supported by GHC that don't use IEEE floating point? If so GHC.Float is wrong as isIEEE is always true.
The one most likely to be non-IEEE is ARM, which has a middle-endian representation; to make it explicit, it's the middle case here (FLOAT_WORDS_BIGENDIAN but not WORDS_BIGENDIAN):
#if WORDS_BIGENDIAN unsigned int negative:1; unsigned int exponent:11; unsigned int mantissa0:20; unsigned int mantissa1:32; #else #if FLOAT_WORDS_BIGENDIAN unsigned int mantissa0:20; unsigned int exponent:11; unsigned int negative:1; unsigned int mantissa1:32; #else unsigned int mantissa1:32; unsigned int mantissa0:20; unsigned int exponent:11; unsigned int negative:1; #endif #endif
Does anyone know if that makes it non-IEEE?
AIUI, ieee754 talks about high bits and low bits, not first or last bytes, which means that it is endianness independant. this also means that ieee754 values are endian dependant - we'll have to swap them into network byte order before saving, if we're on a le host. Stefan

On Wed, Apr 18, 2007 at 12:34:58PM +1000, Duncan Coutts wrote:
We can't actually guarantee that we have any IEEE format types available. The isIEEE will tell you if a particular type is indeed IEEE but what do we do if isIEEE CDouble = False ?
All the computer architectures I've ever used had IEEE format types. Perhaps we could add to the standard libraries a IEEEDouble type and conversions between it and ordinary types. This would put the ugly ARM hackery where it belongs, I suppose.
Perhaps we just don't care about ARM or other arches where GHC runs that do not use IEEE formats, I don't know. If that were the case we'd say something like:
I don't.
instance Binary Double where put d = assert (isIEEE (undefined :: Double)) $ do write (poke d)
I'd rather have this or nothing. It may be that there are people out there who want to serialize and read Doubles to and from Haskell, but I imagine most people want to read or write formats that can interoperate with other languages (which is the only reason I'm looking into Binary now). It's rather inconvenient (and took me quite some time to track down) having such a non-standard serialization for Double. If there were no Binary instance for Double, I could write this myself, but alas, once an instance is declared, there's no way to undeclare it, and the workarounds aren't pretty. I suppose I can newtype DDouble = D Double unD (D d) = d instance Binary DDouble where put (D d_ = assert (isIEEE (undefined :: Double)) $ write (poke d) putDouble = put . D
If we do care about ARM and the like then we need some way to translate from the native Double encoding to an IEEE double external format. I don't know how to do that. I also worry we'll end up with lots of #ifdefs.
I'd say lots of #ifdefs are okay. This is a low-level library dealing with low-level architecture differences.
The other problem with doing this efficiently is that we have to worry about alignment for that poke d operation. If we don't know the alignment we have to poke into an aligned side buffer and copy over. Similar issues apply to reading.
Right now, efficiency is less of a concern to me than ease. I imagine the efficiency can be fixed up later? I'd think you could statically check the alignment with a bit of type hackery (and note that I said I thought *you* could, not *I* could). Something like creating two monad types, an aligned one and an arbitrary one, and at run-time select which monad to use, so the check could occur just once. -- David Roundy http://www.darcs.net

On Wed, 2007-04-18 at 08:30 -0700, David Roundy wrote:
On Wed, Apr 18, 2007 at 12:34:58PM +1000, Duncan Coutts wrote:
We can't actually guarantee that we have any IEEE format types available. The isIEEE will tell you if a particular type is indeed IEEE but what do we do if isIEEE CDouble = False ?
All the computer architectures I've ever used had IEEE format types. Perhaps we could add to the standard libraries a IEEEDouble type and conversions between it and ordinary types. This would put the ugly ARM hackery where it belongs, I suppose.
From the point of view of this library that would be ideal yes. I'm not sure the Haskell implementation maintainers would see it the same way.
Perhaps we just don't care about ARM or other arches where GHC runs that do not use IEEE formats, I don't know. If that were the case we'd say something like:
I don't.
:-)
instance Binary Double where put d = assert (isIEEE (undefined :: Double)) $ do write (poke d)
I'd rather have this or nothing. It may be that there are people out there who want to serialize and read Doubles to and from Haskell, but I imagine most people want to read or write formats that can interoperate with other languages (which is the only reason I'm looking into Binary now).
It's rather inconvenient (and took me quite some time to track down) having such a non-standard serialization for Double.
If there were no Binary instance for Double, I could write this myself, but alas, once an instance is declared, there's no way to undeclare it, and the workarounds aren't pretty. I suppose I can
By the way, perhaps it is not obvious yet, but the library is supposed to be split in two halves, serving different audiences and purposes. One is to interoperate with existing externally defined binary data formats. It sounds like your application falls into that category. The other is to serialise Haskell data structures. For the latter case we use the Binary class. You should not care what format you get from using this class, only that it has some useful properties like round-tripping (on the same machine and across architectures and Haskell implementations). If you do care what the format is, you should not be using the Binary class. You should instead be using the other side of the library. Now at the moment the other side is under-developed, it only provides a few primitives. But you can see why using a Binary class is not going to work for these cases where people care about the format, the instance for any particular type is not going to be right:
newtype DDouble = D Double unD (D d) = d
instance Binary DDouble where
and people will for ever be defining newtype wrappers or complaining that the whole library isn't parametrised by the endianness or whatever. For existing formats you need much more flexibility and control. The Binary class is to make it really convenient to serialise Haskell types, and it's built on top of the layer that gives you full control. We intend to work more on this other side of the library in the coming couple of months. If you could tell us a bit more about your use case, that'd be great.
If we do care about ARM and the like then we need some way to translate from the native Double encoding to an IEEE double external format. I don't know how to do that. I also worry we'll end up with lots of #ifdefs.
I'd say lots of #ifdefs are okay. This is a low-level library dealing with low-level architecture differences.
Yeah, maybe, but it makes me grumble. :-)
The other problem with doing this efficiently is that we have to worry about alignment for that poke d operation. If we don't know the alignment we have to poke into an aligned side buffer and copy over. Similar issues apply to reading.
Right now, efficiency is less of a concern to me than ease. I imagine the efficiency can be fixed up later?
Possibly. We would like to get as much efficiency for free, without having to clutter the library API with stuff. Alignment is one of the harder cases since it looks like we don't have quite enough information. It's not clear yet, we're still pondering it.
I'd think you could statically check the alignment with a bit of type hackery
Aye, we've thought a little about that, though not quite enough to have any concrete ideas to try yet.
(and note that I said I thought *you* could, not *I* could).
Yes, I note the difference :-).
Something like creating two monad types, an aligned one and an arbitrary one, and at run-time select which monad to use, so the check could occur just once.
That requires that the format is alignment preserving, which is information that is hard to recover given an api that has primitives for packing various sized objects in a sequence. Duncan

On Thu, Apr 19, 2007 at 10:20:21AM +1000, Duncan Coutts wrote:
and people will for ever be defining newtype wrappers or complaining that the whole library isn't parametrised by the endianness or whatever. For existing formats you need much more flexibility and control. The Binary class is to make it really convenient to serialise Haskell types, and it's built on top of the layer that gives you full control.
We intend to work more on this other side of the library in the coming couple of months. If you could tell us a bit more about your use case, that'd be great.
I just want to read in a file full of Doubles (written in binary format from C++) and print out text (into a pipe or file to be read by gnuplot). It's not a high-performancs use (the file is only a megabyte or so), but it's something that *ought* to be easy, and so far as I can tell, it requires tricky hackery. I suppose I was just disappointed, because I'd figured that the Binary library was there to do what I wanted. :( It was something I could have done in five minutes (counting tuning the gnuplot file) in perl, and it's embarrassing (which makes it frustrating) to fail in Haskell to complete it in... I couldn't say how long, an hour or so? I know I could have used an Array, or used a Ptr and Storable, but this was supposed to be an easy safe scripting problem, and in my opinion neither of those qualify. -- David Roundy http://www.darcs.net

On Wed, 2007-04-18 at 21:12 -0700, David Roundy wrote:
On Thu, Apr 19, 2007 at 10:20:21AM +1000, Duncan Coutts wrote:
and people will for ever be defining newtype wrappers or complaining that the whole library isn't parametrised by the endianness or whatever. For existing formats you need much more flexibility and control. The Binary class is to make it really convenient to serialise Haskell types, and it's built on top of the layer that gives you full control.
We intend to work more on this other side of the library in the coming couple of months. If you could tell us a bit more about your use case, that'd be great.
I just want to read in a file full of Doubles (written in binary format from C++) and print out text (into a pipe or file to be read by gnuplot). It's not a high-performancs use (the file is only a megabyte or so), but it's something that *ought* to be easy, and so far as I can tell, it requires tricky hackery.
Right, it should be simple. With the api I'm hacking on at the moment you just have primitives like (importing Data.Binary.Get as Get): Get.word :: Get Word Get.word8 :: Get Word8 and you'd need: Get.double :: Get Double --IEEE double format As you noticed, that's not a primitive which we have yet. With that it should be really easy to get a sequence of them: Get.run (mapM (const Get.double) [0..n]) :: Lazy.ByteString -> [Double] or whatever.
I suppose I was just disappointed, because I'd figured that the Binary library was there to do what I wanted. :( It was something I could have done in five minutes (counting tuning the gnuplot file) in perl, and it's embarrassing (which makes it frustrating) to fail in Haskell to complete it in... I couldn't say how long, an hour or so?
Yeah, we've concentrated so far on the serialisation of Haskell values, not reading/writing externally defined binary formats. I don't think we've been especially clear on that. But we do intend to tackle both. Duncan

Duncan Coutts wrote:
Yeah, we've concentrated so far on the serialisation of Haskell values, not reading/writing externally defined binary formats. I don't think we've been especially clear on that. But we do intend to tackle both.
Speaking for myself, I certainly didn't realise you were intending to solve these two different problems. Serialisation and binary data access are clearly quite different issues (though it makes sense for the one to layer on the other). Perhaps you should (a) be clearer in your propaganda and (b) give the two parts different names? Jules

On Thu, 2007-04-19 at 12:23 +0100, Jules Bean wrote:
Duncan Coutts wrote:
Yeah, we've concentrated so far on the serialisation of Haskell values, not reading/writing externally defined binary formats. I don't think we've been especially clear on that. But we do intend to tackle both.
Speaking for myself, I certainly didn't realise you were intending to solve these two different problems. Serialisation and binary data access are clearly quite different issues (though it makes sense for the one to layer on the other). Perhaps you should (a) be clearer in your propaganda and (b) give the two parts different names?
Aye, it was possibly a mistake to call it Data.Binary since people interpret that to mean whichever of those two problems that person needs to solve :-). We should rename the Haskell value serialisation part to Data.Binary.Serialise or something. Then you'll have to decide at the point you write your imports which kind of problem you're dealing with, by importing either Data.Binary.Serialise or Data.Binary.Get and .Put. That should help make it clearer to people. Duncan

On Wed, Apr 18, 2007 at 09:12:30PM -0700, David Roundy wrote:
I just want to read in a file full of Doubles (written in binary format from C++)
Note that if you write double's from C++ then you need to read CDoubles in Haskell and then realToFrac them (which will presumably be optimised out in practice). Or alternatively you can work with HsDouble's in C++. Thanks Ian

On Sun, Apr 22, 2007 at 10:36:17PM +0100, Ian Lynagh wrote:
On Wed, Apr 18, 2007 at 09:12:30PM -0700, David Roundy wrote:
I just want to read in a file full of Doubles (written in binary format from C++)
Note that if you write double's from C++ then you need to read CDoubles in Haskell and then realToFrac them (which will presumably be optimised out in practice).
This is a one-off script, which doesn't need to be portable. Or, it would have been, if I hadn't written it in perl and octave. -- David Roundy http://www.darcs.net
participants (6)
-
Brian Alliet
-
David Roundy
-
Duncan Coutts
-
Ian Lynagh
-
Jules Bean
-
Stefan O'Rear