Binary and Serialize classes

It's been remarked to me that relying on the Binary and Serialize classes is dangerous, because there's no guarantee they won't maintain a consistent format. So if my app uses the Serialize instances that come with cereal, it could suddenly fail to read old saves after an upgrade to cereal. However, neither binary nor cereal expose the underlying serialization algorithms for various types except through the instances, so I would have to copy and paste the code over if I want control over it. If I don't trust 'instance Serialize String' to not change behind my back, maybe I could at least trust 'Data.Serialize.stringToUtf8' to not change since if it did the name would now be wrong. Are these fears justified? I imagine if the Int instance for Serialize changed there would be an uproar and it would probably have to be changed back. I sent a bug to the maintainers of data-binary a long time ago about the Double instance not serializing -0, and they ignored it, probably because it would be bad to change the instance. So can I use the included instances without fear of them changing between releases? Of course I still run the risk of an instance from some other package changing, but I'm less likely to be using those. Speaking of that Double instance... both data-binary and cereal use decodeFloat and encodeFloat which mean they suffer from the same problems as realToFrac, namely that -0 becomes 0 and NaN becomes -Infinity (I'm not sure why the latter happens since the decoded values differ... must be a problem with encodeFloat). It's tempting to just get the ieee754 bitmap out and serialize that. I know I've seen this question around before, but is it possible to somehow cast a D# directly to bytes? I know I can write a C function and FFI that in, but it would be tidier to do it all in haskell. I guess I can probably use castPtr and memCpy, but I don't see the required addressOf. I.e. how would I write 'memcpy(buf, &d, sizeof(double));'?

Speaking of that Double instance... both data-binary and cereal use decodeFloat and encodeFloat which mean they suffer from the same problems as realToFrac, namely that -0 becomes 0 and NaN becomes -Infinity (I'm not sure why the latter happens since the decoded values differ... must be a problem with encodeFloat). It's tempting to just get the ieee754 bitmap out and serialize that. I know I've seen this question around before, but is it possible to somehow cast a D# directly to bytes? I know I can write a C function and FFI that in, but it would be tidier to do it all in haskell. I guess I can probably use castPtr and memCpy, but I don't see the required addressOf. I.e. how would I write 'memcpy(buf, &d, sizeof(double));'?
Also serialized data takes ~3 times more space than IEEE754 representation. It's a concern when a lot of data is serialized. Serialization in IEEE754 format is implemented already for binary. You can check package data-binary-ieee754[1] for [1] http://hackage.haskell.org/package/data-binary-ieee754

On Thu, Apr 28, 2011 at 10:00 AM, Evan Laforge
It's been remarked to me that relying on the Binary and Serialize classes is dangerous, because there's no guarantee they won't maintain a consistent format. So if my app uses the Serialize instances that come with cereal, it could suddenly fail to read old saves after an upgrade to cereal.
However, neither binary nor cereal expose the underlying serialization algorithms for various types except through the instances, so I would have to copy and paste the code over if I want control over it. If I don't trust 'instance Serialize String' to not change behind my back, maybe I could at least trust 'Data.Serialize.stringToUtf8' to not change since if it did the name would now be wrong.
Are these fears justified? I imagine if the Int instance for Serialize changed there would be an uproar and it would probably have to be changed back. I sent a bug to the maintainers of data-binary a long time ago about the Double instance not serializing -0, and they ignored it, probably because it would be bad to change the instance. So can I use the included instances without fear of them changing between releases? Of course I still run the risk of an instance from some other package changing, but I'm less likely to be using those.
When I need to comply with a specific binary format, I never rely on Binary/Serialize class instances - I always fall back on the primitive operations on Words of varying sizes (perhaps defining my own type classes for convenience). The 'Builder' type makes this pretty easy. If I were writing binary data to disk, in my mind that would fall under "complying with a specific binary format". I do, however, rely on the SafeCopy class (or the equivalent Happstack.Data.Serialize class) to be able to read it's own data back from persistent storage - it is a specific design goal of the library and the library has version support built in. If the authors of the library come up with a better way to store maps, I would expect them to bump the version tag for the stored data and provide automatic migration for old data. Antoine

When I need to comply with a specific binary format, I never rely on Binary/Serialize class instances - I always fall back on the primitive operations on Words of varying sizes (perhaps defining my own type classes for convenience). The 'Builder' type makes this pretty easy.
If I were writing binary data to disk, in my mind that would fall under "complying with a specific binary format".
Indeed, and I was starting to do that... well, I would make my own project specific Serialize class, since the type dispatch is useful. But copy pasting a UTF8 encoder, or the variable length Integer encoder, or whatever seemed kinda unpleasant. Surely we could expose that stuff in a library, whose explicit goal was that they *would* remain compatible ways to serialize various basic types, and then just reuse those functions? E.g. that is already done for words with the putWordN{be,le} functions, and is available separately for UTF8.
I do, however, rely on the SafeCopy class (or the equivalent Happstack.Data.Serialize class) to be able to read it's own data back from persistent storage - it is a specific design goal of the library
Indeed, I also wound up inventing my own versioning format, which looks basically the same as safe copy, except much simpler, I just put a version byte, and then case on that for deserialization. However, I only put the version on things I have reason to change, and that doesn't include built-in data types, like Integer or String. So it's all my own data types, which I control the instance declarations for anyway, so I'm not really worried about those.

On Fri, Apr 29, 2011 at 10:25 AM, Evan Laforge
Indeed, and I was starting to do that... well, I would make my own project specific Serialize class, since the type dispatch is useful. But copy pasting a UTF8 encoder, or the variable length Integer encoder, or whatever seemed kinda unpleasant. Surely we could expose that stuff in a library, whose explicit goal was that they *would* remain compatible ways to serialize various basic types, and then just reuse those functions? E.g. that is already done for words with the putWordN{be,le} functions, and is available separately for UTF8.
I intend to add support for different UTF encodings to Data.Binary.Builder for this very reason. I also intend to add two functions to Data.Binary.Builder.Internal that lets you implement variable length encoding as efficiently as possible. I'm a bit skeptical of adding builders for different variable length encodings to the library, simply because there are so many possibilities. I think creating a binary-vle (for variable length encoding) package would be worthwhile. I have an implementation of the VLE used in protocol buffers. Johan

I'm a bit skeptical of adding builders for different variable length encodings to the library, simply because there are so many possibilities. I think creating a binary-vle (for variable length encoding) package would be worthwhile. I have an implementation of the VLE used in protocol buffers.
I didn't necessarily mean the general notion of variable length encodings, I just meant the encodings for the built-in types that are a little more complex than most. For example, export a putInteger that's documented to not change its encoding. Same for putString, and even trivial ones like putPair, putEtcEtc. Then you'd have to declare a bunch of boilerplate like 'instance MySerialize Integer where put = putInteger'. It would be annoying, but less annoying than copying the contents of putInteger everywhere, and you'd be guaranteed to be explicitly depending on all your implementations, so you could either use ones explicitly documented to be consistent, or write your own.

On Thursday 28 April 2011 17:00:49, Evan Laforge wrote:
Speaking of that Double instance... both data-binary and cereal use decodeFloat and encodeFloat which mean they suffer from the same problems as realToFrac, namely that -0 becomes 0 and NaN becomes -Infinity (I'm not sure why the latter happens since the decoded values differ... must be a problem with encodeFloat).
Yes, encodeFloat - at least in ghc - first converts the mantissa (while there are words left, multiply the accumulator with 2^32 or 2^64, add next word to it), then adjusts the exponent via ldexp. That process can't produce NaNs. It would be possible to check for special patterns, but it's impossible to know whether that pattern arose from a decodeFloat and hence should be converted to NaN or from a calculation that simply overflowed and should be converted to ±Infinity.
It's tempting to just get the ieee754 bitmap out and serialize that.
binary-ieee754 does that, iirc.
I know I've seen this question around before, but is it possible to somehow cast a D# directly to bytes?
Not safely. unsafeCoerce :: Double -> Word64 will usually work, but if you wantsomething reliable, write your Double to an STUArray and read a Word(64) from the result of castSTUArray. That's slower than unsafeCoerce, but supposed to be reliable (unless I misremember).
I know I can write a C function and FFI that in, but it would be tidier to do it all in haskell. I guess I can probably use castPtr and memCpy, but I don't see the required addressOf. I.e. how would I write 'memcpy(buf, &d, sizeof(double));'?

Hello,
You might consider using safecopy, which explicitly supports the case
where the serialization format or the datastructure itself changes and
the data needs to be migrated to the new format?
http://acid-state.seize.it/safecopy
- jeremy
On Thu, Apr 28, 2011 at 10:00 AM, Evan Laforge
It's been remarked to me that relying on the Binary and Serialize classes is dangerous, because there's no guarantee they won't maintain a consistent format. So if my app uses the Serialize instances that come with cereal, it could suddenly fail to read old saves after an upgrade to cereal.
However, neither binary nor cereal expose the underlying serialization algorithms for various types except through the instances, so I would have to copy and paste the code over if I want control over it. If I don't trust 'instance Serialize String' to not change behind my back, maybe I could at least trust 'Data.Serialize.stringToUtf8' to not change since if it did the name would now be wrong.
Are these fears justified? I imagine if the Int instance for Serialize changed there would be an uproar and it would probably have to be changed back. I sent a bug to the maintainers of data-binary a long time ago about the Double instance not serializing -0, and they ignored it, probably because it would be bad to change the instance. So can I use the included instances without fear of them changing between releases? Of course I still run the risk of an instance from some other package changing, but I'm less likely to be using those.
participants (6)
-
Alexey Khudyakov
-
Antoine Latter
-
Daniel Fischer
-
Evan Laforge
-
Jeremy Shaw
-
Johan Tibell