storing highly shared data structures

Dear Haskell Experts, for storing highly shared data structures we use so called Annotated Terms (shortly ATerms, details below). http://www.cwi.nl/htbin/sen1/twiki/bin/view/SEN1/ATerm In contrast to the Binary (or GhcBinary) class we compute the sharing, which saves a lot of space for data types that keep redundant information. With this we can store some of our data structures (of course only non-cyclic and finite ones) in a few KBs that need MBs if stored without sharing (as when using the Binary or the Show/Read classes). So far so good. The problem remaining is that an object is _traversed_ as if being unshared and thus the _time_ for the ATermTable construction becomes too long for us. GHC's internal data structures (on the heap) are in many cases shared, by pointer references. I.e. if I add a (single) symbol table to every symbol that I use, then the symbol table will not be copied, but only a reference added to my symbol. How can I detect this sharing in order to avoid traversing the very same symbol table for every symbol? I've tried to use a "Map (Ptr ()) ShATerm". So before traversing an object I look up its address and check if is was traversed before (if not I traverse it and store it in my map for future lookups). 1.) I'm not sure if it is safe to use "Ptr ()" (meanwhile garbage collection, heap compaction and what not could happen). 2.) A single "Map (Ptr ()) ShATerm" does not work! It seems that sometimes objects are shared that have different types (as tests revealed). This seems obvious for newtypes but also happens (I think) without newtype declarations. However, shared ATerms are always different for different types, because the corresponding data constructors are different. So, I'm now thinking if I should try using the type of my data as key as well, i.e. using "Map (TypeRep, Ptr ()) ShATerm" to avoid duplicate traversals. Would this be worth the effort? What other possibilities could I try? An obvious solution is to avoid the redundancy in the first place (i.e. simply don't store the symbol table with every symbol), but that would require a major change of our sources, and would the code become clearer? (Many functions would need more arguments.) Finally, _reading in_ shared ATerms is fast, since ghc seems to exploit the sharing given by the injective ATermTable. Thanks for any help Christian More details: Brand, M.G.J. van den, H.A. de Jong, P. Klint, and P.A. Olivier (2000). "Efficient Annotated Terms." Software -- Practice & Experience 30:259--291. (The link "ATerm Library" http://www.cwi.nl/projects/MetaEnv/haterm/ under http://www.haskell.org/libraries/#compilation looks outdated to me. We have extra source code that - at the moment - cannot be distributed separately) Via an extension to DrIFT, http://repetae.net/john/computer/haskell/DrIFT, we generate class instances (similar to the Binary class) for our types that we want to store. Our ATerms are (basically) constructor terms of the form: data ShATerm = ShAAppl String [Int] deriving (Eq, Ord) (The annotation part of ATerms is omitted here since we don't use it) The Int-list contains the arguments as indices into an ATermTable being an injective "IntMap ShATerm" with update and lookup functions: addATerm :: ShATerm -> ATermTable -> (ATermTable, Int) getATerm :: Int -> ATermTable -> ShATerm The class (that DriFT derives instances for) looks as follows: class ShATermConvertible t where toShATerm :: ATermTable -> t -> (ATermTable, Int) fromShATerm :: (ATermTable, Int) -> t The automatically derived instance of i.e. a user-defined List data type "data List a = Nil | Cons a (List a)" is: instance ShATermConvertible a => ShATermConvertible (List a) where toShATerm att0 Nil = addATerm (ShAAppl "Nil" []) att0 toShATerm att0 (Cons a b) = case toShATerm att0 a of { (att1, a') -> case toShATerm att1 b of { (att2, b') -> addATerm (ShAAppl "Cons" [a', b']) att2 }} fromShATerm (att, i) = case getATerm i att of ShAAppl "Nil" [] -> Nil ShAAppl "Cons" [a, b] -> case fromShATerm (att, a) of { a' -> case fromShATerm (att, b) of { b' -> Cons a' b' }} _ -> error "ShATermConvertible List" (Internally the ATermTable also keeps an inverse "Map ShATerm Int" to support the efficient implementation of addATerm.) We can use the "baffle" tool to convert our files from shared (TAF) to unshared ATerm format and vice versa.

On 22.12 14:43, Christian Maeder wrote:
How can I detect this sharing in order to avoid traversing the very same symbol table for every symbol?
By using System.Mem.StableName SerTH (http://cs.helsinki.fi/u/ekarttun/SerTH/) implements this, so you can look at the source for pointers.
I've tried to use a "Map (Ptr ()) ShATerm". So before traversing an object I look up its address and check if is was traversed before (if not I traverse it and store it in my map for future lookups).
GHC can move the objects in memory.
2.) A single "Map (Ptr ()) ShATerm" does not work! It seems that sometimes objects are shared that have different types (as tests revealed). This seems obvious for newtypes but also happens (I think) without newtype declarations.
Yes, this is quite evil. For example the empty list can be shared across type boundaries. Also phantom types can share the same address. In addition you have to worry about libraries using unsafeCoerce#
So, I'm now thinking if I should try using the type of my data as key as well, i.e. using "Map (TypeRep, Ptr ()) ShATerm" to avoid duplicate traversals.
TypeRep is not Ord iirc, which makes this harder. Of course you can show it and then use that. - Einar

Einar Karttunen wrote:
On 22.12 14:43, Christian Maeder wrote:
How can I detect this sharing in order to avoid traversing the very same symbol table for every symbol?
By using System.Mem.StableName SerTH (http://cs.helsinki.fi/u/ekarttun/SerTH/) implements this, so you can look at the source for pointers.
Thanks for your hints. What a pity that "makeStableName" goes to IO (why would this break referential transparency?) and that "StableName a" and "TypeRep" have no Ord instances. Facing RealWorld (and HashTable), Christian

On Thursday 29 December 2005 18:22, Christian Maeder wrote:
Einar Karttunen wrote:
On 22.12 14:43, Christian Maeder wrote:
How can I detect this sharing in order to avoid traversing the very same symbol table for every symbol?
By using System.Mem.StableName SerTH (http://cs.helsinki.fi/u/ekarttun/SerTH/) implements this, so you can look at the source for pointers.
Thanks for your hints. What a pity that "makeStableName" goes to IO (why would this break referential transparency?) and that "StableName a" and "TypeRep" have no Ord instances.
This has been asked before (by me too). The reason is that (due to efficient implementation issues) the order relation between TypeReps is not guaranteed to be the same for every program run. I guess the situation is similar with makeStableName, which is in the IO monad because the result may be different between program runs even given the same value as input. Ben

I wrote:
However, shared ATerms are always different for different types, because the corresponding data constructors are different.
This isn't quite true. The shared ATerm for the empty list is the same for all instances.
Finally, _reading in_ shared ATerms is fast, since ghc seems to exploit the sharing given by the injective ATermTable.
This was also wrong. The ATermTable as "IntMap ShATerm" was not enough, because all objects were constructed anew from the same or different ShATerms. I had to setup an additional "IntMap Dynamic" for creating shared data structures. It was no fun to make everything "Typeable" (and having no Ord instance for TypeRep). Simon suggested to use StableName or StablePtr. I did not use StablePtr, because it lacks a hashing function! I wasn't happy with StableName either, because 1) it required IO 2) System.Mem.StableName is marked "non-portable" (without a reason in the haddock documentation) 3) There was no coercion function "StableName a -> StableName b" I coerced all my objects with makeStableName and GHC.Prim.unsafeCoerce# to "StableName ()" (I became non-portable, anywy.) Now I could use "Data.HashTable (StableName ()) Value" since I was in the IO monad anyway. (The Value is basically an Int, but here I want to avoid the potential confusion with the key.) For the sake of interest I also tried "IntMap [(StableName (), Value)]" using hashStableName as IntMap key. (An IntMap is also used in http://cs.helsinki.fi/u/ekarttun/SerTH that Einar Karttunen pointed to) It turned out that the IntMap was not slower than the HashTable (What is HashTable good for, then? Why is it so slow?) Since I used hashStableName to get a key for my IntMap, I see no reason why StableName should not have an Ord instance, in order to avoid the whole hashing with lists (although a "Data.Map (StableName ()) Value" may not be faster). Summerizing, the user-friendliness (at least) of System.Mem.StableName should be improved (also with respect to the "+RTS -A10m" business for performing makeStableName). Cheers Christian

Hello Christian, Wednesday, January 11, 2006, 5:13:25 PM, you wrote: CM> It turned out that the IntMap was not slower than the HashTable (What is CM> HashTable good for, then? Why is it so slow?) see the http://cvs.haskell.org/trac/ghc/ticket/650 and the attached letter. "-A10m" should help in this case -- Best regards, Bulat mailto:bulatz@HotPOP.com

Hi Bulat,
The difference between IntMap and HashTable is not large despite -A10m
(without this option HashTable is unusable).
HashTable:
<
Hello Christian,
Wednesday, January 11, 2006, 5:13:25 PM, you wrote:
CM> It turned out that the IntMap was not slower than the HashTable (What is CM> HashTable good for, then? Why is it so slow?)
see the http://cvs.haskell.org/trac/ghc/ticket/650
and the attached letter. "-A10m" should help in this case
participants (4)
-
Benjamin Franksen
-
Bulat Ziganshin
-
Christian Maeder
-
Einar Karttunen