Design suggestion for Data.Binary.Defer

Hi, I'm in the process of updating the Deferred Binary library, http://www-users.cs.york.ac.uk/~ndm/binarydefer/. The idea is that its does serialisation, but certain elements can be marked as deferred - instead of being written in the current file stream, they are merely pointed at and if needed, that pointer will be followed. Example:("hello","world"), where the first field is marked as deferred would write out: [6]"world""hello" i.e. [skip 6 characters if you want hello], the "world" data, the "hello" data we previously promised to put here. When reading, the "hello" would only be seeked to and read if necessary. So, its like binary, but some fields are lazy. The question is how to indicate which fields are lazy. There are three schemes I can think of: == Simple Instances == put (a,b) = putDefer a >> put b get = do a <- getDefer; b <- get; return (a,b) If the put/get and putDefer/getDefer items do not line up perfectly it will go very wrong at runtime - probably resulting in random values being created. You also can't derive the instances automatically, with something like Derive or DrIFT. == Complex Instances == This is the current scheme, based on lazy pattern matching and exceptions - very confusing, probably low performance. deferBoth = [\~(a,b) -> unit (,) <<~ a << b] Where <<~ a means write out the a field lazily, and << b means write out the b field strictly. The advantage over the simple instances is that a field being deferred is declared once. == Lazy Data Type == Instead of customizing the instance, you can write a data Defer a = Defer a type, and then instead of the original tuple write: (Defer "hello","world") But now the code must unwrap the Defer before accessing "hello", but the instance becomes much simpler, and can be derived. == The Question == Is there a simple way of tagging fields in a constructor as deferred, just once for reading and writing, and ideally outside the instance definition and not requiring additional code to unwrap? I can't think of any, but there may be something I have missed. Thanks Neil

On Mon, 2008-06-16 at 17:43 +0100, Neil Mitchell wrote:
Hi,
I'm in the process of updating the Deferred Binary library, http://www-users.cs.york.ac.uk/~ndm/binarydefer/. The idea is that its does serialisation, but certain elements can be marked as deferred - instead of being written in the current file stream, they are merely pointed at and if needed, that pointer will be followed.
Example:("hello","world"), where the first field is marked as deferred would write out:
[6]"world""hello"
i.e. [skip 6 characters if you want hello], the "world" data, the "hello" data we previously promised to put here. When reading, the "hello" would only be seeked to and read if necessary.
So, its like binary, but some fields are lazy. The question is how to indicate which fields are lazy. There are three schemes I can think of:
== Simple Instances ==
put (a,b) = putDefer a >> put b get = do a <- getDefer; b <- get; return (a,b)
If the put/get and putDefer/getDefer items do not line up perfectly it will go very wrong at runtime - probably resulting in random values being created. You also can't derive the instances automatically, with something like Derive or DrIFT.
== Complex Instances ==
This is the current scheme, based on lazy pattern matching and exceptions - very confusing, probably low performance.
deferBoth = [\~(a,b) -> unit (,) <<~ a << b]
Where <<~ a means write out the a field lazily, and << b means write out the b field strictly. The advantage over the simple instances is that a field being deferred is declared once.
== Lazy Data Type ==
Instead of customizing the instance, you can write a data Defer a = Defer a type, and then instead of the original tuple write:
(Defer "hello","world")
But now the code must unwrap the Defer before accessing "hello", but the instance becomes much simpler, and can be derived.
== The Question ==
Is there a simple way of tagging fields in a constructor as deferred, just once for reading and writing, and ideally outside the instance definition and not requiring additional code to unwrap? I can't think of any, but there may be something I have missed.
This isn't an answer to your question. This is just my opinion based on my values. I'd just immediately do option three. It seems the simplest, most composable and most flexible, and I like having things reflected in my types. It seems easy enough to code up the first option given the last. I don't really understand what you are getting at with the second option, but I suspect it too is easy to do on top of the third option.

On Mon, Jun 16, 2008 at 9:43 AM, Neil Mitchell
== The Question ==
Is there a simple way of tagging fields in a constructor as deferred, just once for reading and writing, and ideally outside the instance definition and not requiring additional code to unwrap? I can't think of any, but there may be something I have missed.
Is there any reason not to just write all lazy fields of variable size in a deferred manner? That would seem like the best option to me, but maybe that's just because I don't see any downside besides the requirement that one use some space to write the size. But that seems like it would almost always be a trivial issue, as any small data element will have a fixed size (so that the deferred/non-deferred distinction doesn't exist). David

Hi, Derek: Yes, I was slowly coming round to the idea of the third option. It does mean that if you want to change the deferredness of a field, you have to change the type, which is a substantial alteration. David:
Is there any reason not to just write all lazy fields of variable size in a deferred manner?
I completely hadn't though of this! You will loose a bit of time, for example reading fields which were previously not deferred will require a file seek, but the effort/thought/efficiency trade-off is good. I think I will go with this. Thanks Neil

On Tue, Jun 17, 2008 at 07:55:51AM +0100, Neil Mitchell wrote:
Hi,
Hello!
David:
Is there any reason not to just write all lazy fields of variable size in a deferred manner?
I completely hadn't though of this! You will loose a bit of time, for example reading fields which were previously not deferred will require a file seek, but the effort/thought/efficiency trade-off is good. I think I will go with this.
Actually, you ought to be able to pretty easily remove this tradeoff by introducing a strict read function as a new method in your class. So anyone who wants to strictly read lazy data can use that function instead of the lazy one. The writing of data is unaffected... or rather, it *is* affected, but only in the sense that writing deferred data cannot be as lazy as the writing of undeferred data. e.g. when writing a deferred list, you can't output the first word until you've already read through the entire list in order to find out how long it is. That could actually be quite a problem for code that relies on laziness to handle massive amounts of data. :( Lazy reading seems to require strict writing, while lazy writing requires strict reading! So perhaps you want a pair of read functions, one strict, one lazy, and a pair of write functions. That doesn't provide any sort of fine-grained control, but at least allows streaming in either direction. You could use the same data format for either strictly or lazily pickled data as long as you have a reserved value for the length-to-be-skipped, so the lazy reader could tell if it reaches data that whose length it doesn't know. David

Excerpts from David Roundy's message of Tue Jun 17 20:27:01 +0200 2008:
On Tue, Jun 17, 2008 at 07:55:51AM +0100, Neil Mitchell wrote:
Hi,
Hello!
Hello,
:( Lazy reading seems to require strict writing, while lazy writing requires strict reading!
I'm wondering if it would be an option to read the file backward. If so length annotations could be put at the end. Reading backward seems to only fail with streamed data, which won't support the "seek"ing required by a lazy reading anyway. Regards, -- Nicolas Pouillard aka Ertai

Hi
Actually, you ought to be able to pretty easily remove this tradeoff by introducing a strict read function as a new method in your class. So anyone who wants to strictly read lazy data can use that function instead of the lazy one.
Not quite, the library is written so that strict fields are at the current file pointer while deferred fields require a position jump. For example, two strict strings would be: "hello""world" But two lazy strings would be: [ptr 5][ptr 10]"hello""world"
:( Lazy reading seems to require strict writing, while lazy writing requires strict reading!
The library is all designed around making reading maximally efficient, and minimizing file seeks, but writing can be totally inefficient. As you mention, the lazy/strict IO trade off is very interesting, its just fortunate that for my application (Hoogle) the files are written once and read many many times.
I'm wondering if it would be an option to read the file backward. If so length annotations could be put at the end.
Reading backward seems to only fail with streamed data, which won't support the "seek"ing required by a lazy reading anyway.
You still want to minimize seeking, even if it is possible. In my case having length information at the front helps reading, and I don't care about writing, so its an easy win. I think I've got my design nailed down, and have most of it implemented and used in Hoogle. These discussions and points were very helpful! Thanks Neil
participants (5)
-
David Roundy
-
David Roundy
-
Derek Elkins
-
Neil Mitchell
-
Nicolas Pouillard