
I want to sequence data structures in an efficient manner, to store them to files and to send them over the network. Simply deriving Show and Read is not very good, because it's space inefficient and read cannot give any output until the whole data structure is parsed. So I thought I should store them in some space efficient format. First problem: how to make them derivable (so that I don't have to write boilerplate class instances for all my data structures). I read the "derivable type classes" paper, but it's not implemented in ghc (only Unit and :+: and :*: are, which is not enough). So how to go about it? Using DrIFT? Template Haskell? Second problem: how should such a format look like? Ideally, I want to be able to write an infinite data structure (at least one containing loops). If that is not possible I want to be able to read as lazily as possible, this means traversing the data structure in breadth first order, so that a cons form can be reached quickly. Does anyone have any thoughts/pointers on this subject? It would surprise me if nobody has had this problem before. Regards, Martin -- Martin Norbäck d95mback@dtek.chalmers.se Kapplandsgatan 40 +46 (0)708 26 33 60 S-414 78 GÖTEBORG http://www.dtek.chalmers.se/~d95mback/ SWEDEN OpenPGP ID: 3FA8580B

hello, there were a lot of discussions on the library mailing list about deriving binary, which is related to what you are asking. i don't think that dealt with cyclic datatypes though. i don't think you can do much about that from within haskell, unless you somehow encoded the sharing explicitly in the datastructure (by adding node identifiers or something). bye iavor Martin Norbäck wrote:
I want to sequence data structures in an efficient manner, to store them to files and to send them over the network.
Simply deriving Show and Read is not very good, because it's space inefficient and read cannot give any output until the whole data structure is parsed.
So I thought I should store them in some space efficient format.
First problem: how to make them derivable (so that I don't have to write boilerplate class instances for all my data structures).
I read the "derivable type classes" paper, but it's not implemented in ghc (only Unit and :+: and :*: are, which is not enough).
So how to go about it? Using DrIFT? Template Haskell?
Second problem: how should such a format look like?
Ideally, I want to be able to write an infinite data structure (at least one containing loops). If that is not possible I want to be able to read as lazily as possible, this means traversing the data structure in breadth first order, so that a cons form can be reached quickly.
Does anyone have any thoughts/pointers on this subject? It would surprise me if nobody has had this problem before.
Regards,
Martin
-- ================================================== | Iavor S. Diatchki, Ph.D. student | | Department of Computer Science and Engineering | | School of OGI at OHSU | | http://www.cse.ogi.edu/~diatchki | ==================================================

I want to sequence data structures in an efficient manner, to store them to files and to send them over the network.
Ideally, I want to be able to write an infinite data structure (at least one containing loops). If that is not possible I want to be able to read as lazily as possible, this means traversing the data structure in breadth first order, so that a cons form can be reached quickly.
Does anyone have any thoughts/pointers on this subject? It would surprise me if nobody has had this problem before.
Hi Martin, If you can live with supporting only one compiler then you can get a lot of this done via the FFI. In buddha, my debugger for Haskell, I have to print pretty versions of arbitrary data structures, which is in many ways similar to what you want to do. I recently extracted the relevent code into a library for reifying: http://www.cs.mu.oz.au/~bjpop/code.html It gives: data Graph = ... reify :: a -> IO Graph prettyGraph :: Graph -> String Its not what you want but it might give you some ideas. GHC has quite a nice API for its runtime environment, and it is straightforward to write some C code that crawls over the heap. If you are careful not to cause any garbage collection then you can find cycles without much problem. Some issues are what to do about thunks? You could force them but they might diverge, so it would have to be incremental. Also your data structure might have functions inside it, which are going to be hard no matter what you do. Cheers, Bernie.

On Wednesday 20 August 2003 9:16 am, Bernard James POPE wrote:
If you can live with supporting only one compiler then you can get a lot of this done via the FFI.
In buddha, my debugger for Haskell, I have to print pretty versions of arbitrary data structures, which is in many ways similar to what you want to do.
[details omitted]
This interface looks pretty similar to the interfacein Hugs The module is hugs98/libraries/Hugs/Internals.hs: http://cvs.haskell.org/cgi-bin/cvsweb.cgi/hugs98/libraries/Hugs/Internals.hs?rev=1.2&content-type=text/x-cvsweb-markup&sortby=log It seems to me that, with a little tweaking of the Buddha and Hugs.Internals interfaces, we could provide the same interface on both Hugs and GHC. (It's also pretty simple to implement so it could potentially be added to NHC98 as well.) The major difference between the interfaces is that Buddha's thunk representation is a recursive datatype so the reify operation relies on lazyness to handle cycles and large substructures. Instead, Hugs.Internals thunk representation is a non-recursive datatype and 'classifyCell' (the equivalent of Buddha's reify) is an operation to classify a single thunk as an Int, Float, application, etc. and the program must explicitly traverse function arguments to examine them. This is a pretty small difference - the Buddha-style of interface is easily implemented in terms of the Hugs.Internals interface. Another potential difference is in the semantics. The semantics of the Hugs.Internals operations is difficult because they allow you to examine unevaluated thunks in the heap. For example, you might look at a thunk for: 1+2 and see any of the following: addInt (fromInteger <dictionary> 1) (fromInteger <dictionary> 2) addInt 1 (fromInteger <dictionary> 2) addInt 1 2 3 in fact, a legal (but bizarre) Haskell compiler could also return 4-1 1 + (1+1) id 3 or anything else whose value is 3. To reflect this in the semantics, we use the same trick that we use in the semantics of exception handling and use non-determinism. That is, a thunk is viewed as a set of all possible representations of the same _value_ and classifyCell merely selects one of those representations (which is why classifyCell is in the IO monad). Sadly, the resulting semantics isn't compositional since the semantics of an application e1 e2 contains both 'good' meanings based on the semantics of e1 and e2 [ x y | x <- [[e1]], y <- [[e2]] ] but also many expressions that have nothing to do with the semantics of e1 or of e2 but which just happen to evaluate to the same value (like the legal but bizarre reifications of 1+2 given above). Not having a compositional semantics is frequently regarded as a Bad Thing. -- Alastair Reid ps Hugs.GenericPrint in the same directory contains a generic printing function based on Hugs.Internal that produces output identical to the old Hugs printing mechanism. There's also code (Hugs.CVHAssert) to write assert functions where the assertions are about the size of the arguments (based on an idea by Cordelia Hall) and there's even some code to disassemble bytecode functions. pps I doubt that these modules have been tested in several years - minor bitrot may well have set in.

Hi, Alistair writes:
This interface looks pretty similar to the interfacein Hugs The module is hugs98/libraries/Hugs/Internals.hs:
Yes. You may recall that I had something even closer to the Hugs interface previously which I called GhcInternals. I modelled that on what Hugs provides. I even think that you suggested that we unify their designs. I agree, but I don't think I replied, sorry about that. I think I got tangled up in other problems and never got back to it. Noone else said anything about that version, so I assumed that there was very little interest in it - which is probably true.
It seems to me that, with a little tweaking of the Buddha and Hugs.Internals interfaces, we could provide the same interface on both Hugs and GHC. (It's also pretty simple to implement so it could potentially be added to NHC98 as well.)
I'm all for making a consistent interface. But I wonder what kind of interface people would like and what they would want to do with it, besides writing universal printers (which are very useful nonetheless). Things that seem important (to me) are: - Do you want to observe cycles and or sharing? Of course this is very hard to do because of garbage collection. In buddha I observe cycles but not other forms of sharing. To do this I have to be very careful not to cause garbage collection when I'm traversing the heap. I don't remember now whether you can observe cycles using the Hugs internals interface. I _think_ that it was possible but I don't remember. - Do you want the ability to force thunks or just stop when you see them? The incremental approach of the Hugs interface is nice, but I'm worried that it will make the detection of cycles harder - that's the reason I moved to reify in buddha away from Hugs style. (In buddha I just want to print things up to the extent that they are evaluated when reify is called, I stop when I see a thunk, cycle, nullary constructor). - There needs to be some support from the compilers/interpreters. Hugs already has this. Ghc has some of it, but I abuse the profiling system in order to get the names of constructors to be present on the heap. I'm not happy with this. Under normal compilation GHC doesn't keep constructor names around, as far as I know. However, the new debugger in GHC must do something to get names of things, so perhaps the -prof hack is no longer needed? GHC's heap representation is already quite intricate, and I wonder if the implementors/maintainers would not want to introduce more complexity for the sake of a library that few people might use? I don't know what the case is with NHC, though I seem to recall Malcolm or Olaf saying that such a thing would be possible, but I might be making that up. Cheers, Bernie.

Things that seem important (to me) are:
- Do you want to observe cycles and or sharing?
Yes.
Of course this is very hard to do because of garbage collection.
Only in the sense that holding onto an object pointer to detect a cycle might cause you to hold onto the object too - causing a space leak. Weak pointers probably play a useful role in solving this but I didn't explore this.
In buddha I observe cycles but not other forms of sharing. To do this I have to be very careful not to cause garbage collection when I'm traversing the heap. I don't remember now whether you can observe cycles using the Hugs internals interface. I _think_ that it was possible but I don't remember.
Yes, Hugs lets you see sharing.. The CVHAssert module makes use of cycle and sharing detection in calculating the heap space consumed by an object. The Hugs interface has: - raw objects of type 'alpha' (i.e., the Haskell type of the object) - object pointers of type 'Cell' (would be better called 'Thunk'?) which provide a layer of indirection to the object. This indirection lets you prod a thunk pointer ni various ways without evaluating the object it points to. These pointers support pointer equality tests with short-circuiting of indirection nodes. - Cell classifications - whether it is an Int, Float, application, thunk, etc.
- Do you want the ability to force thunks or just stop when you see them? The incremental approach of the Hugs interface is nice, but I'm worried that it will make the detection of cycles harder - that's the reason I moved to reify in buddha away from Hugs style. (In buddha I just want to print things up to the extent that they are evaluated when reify is called, I stop when I see a thunk, cycle, nullary constructor).
Detecting cycles is easy in the Hugs interface. You can continue observing below an unevaluated thunk if you wish - just print the name of the supercombinator and keep going (or disassemble the supercombinator if you wish). Or you can force an unevaluated thunk if that takes your fancy. -- Alastair

I want to sequence data structures in an efficient manner, to store them to files and to send them over the network.
Simply deriving Show and Read is not very good, because it's space inefficient and read cannot give any output until the whole data structure is parsed.
So I thought I should store them in some space efficient format.
First problem: how to make them derivable (so that I don't have to write boilerplate class instances for all my data structures).
I read the "derivable type classes" paper, but it's not implemented in ghc (only Unit and :+: and :*: are, which is not enough).
So how to go about it? Using DrIFT? Template Haskell?
[Shameless plug]: This is typically the kind of thing Generic Haskell was designed for: http://www.generic-haskell.org/ For more information you can also consult your local Generic Programming expert Patrik Jansson. -- Johan Jeuring
participants (5)
-
Alastair Reid
-
Bernard James POPE
-
Iavor Diatchki
-
Johan Jeuring
-
Martin Norbäck