
Dear Haskell Experts, for storing highly shared data structures we use so called Annotated Terms (shortly ATerms, details below). http://www.cwi.nl/htbin/sen1/twiki/bin/view/SEN1/ATerm In contrast to the Binary (or GhcBinary) class we compute the sharing, which saves a lot of space for data types that keep redundant information. With this we can store some of our data structures (of course only non-cyclic and finite ones) in a few KBs that need MBs if stored without sharing (as when using the Binary or the Show/Read classes). So far so good. The problem remaining is that an object is _traversed_ as if being unshared and thus the _time_ for the ATermTable construction becomes too long for us. GHC's internal data structures (on the heap) are in many cases shared, by pointer references. I.e. if I add a (single) symbol table to every symbol that I use, then the symbol table will not be copied, but only a reference added to my symbol. How can I detect this sharing in order to avoid traversing the very same symbol table for every symbol? I've tried to use a "Map (Ptr ()) ShATerm". So before traversing an object I look up its address and check if is was traversed before (if not I traverse it and store it in my map for future lookups). 1.) I'm not sure if it is safe to use "Ptr ()" (meanwhile garbage collection, heap compaction and what not could happen). 2.) A single "Map (Ptr ()) ShATerm" does not work! It seems that sometimes objects are shared that have different types (as tests revealed). This seems obvious for newtypes but also happens (I think) without newtype declarations. However, shared ATerms are always different for different types, because the corresponding data constructors are different. So, I'm now thinking if I should try using the type of my data as key as well, i.e. using "Map (TypeRep, Ptr ()) ShATerm" to avoid duplicate traversals. Would this be worth the effort? What other possibilities could I try? An obvious solution is to avoid the redundancy in the first place (i.e. simply don't store the symbol table with every symbol), but that would require a major change of our sources, and would the code become clearer? (Many functions would need more arguments.) Finally, _reading in_ shared ATerms is fast, since ghc seems to exploit the sharing given by the injective ATermTable. Thanks for any help Christian More details: Brand, M.G.J. van den, H.A. de Jong, P. Klint, and P.A. Olivier (2000). "Efficient Annotated Terms." Software -- Practice & Experience 30:259--291. (The link "ATerm Library" http://www.cwi.nl/projects/MetaEnv/haterm/ under http://www.haskell.org/libraries/#compilation looks outdated to me. We have extra source code that - at the moment - cannot be distributed separately) Via an extension to DrIFT, http://repetae.net/john/computer/haskell/DrIFT, we generate class instances (similar to the Binary class) for our types that we want to store. Our ATerms are (basically) constructor terms of the form: data ShATerm = ShAAppl String [Int] deriving (Eq, Ord) (The annotation part of ATerms is omitted here since we don't use it) The Int-list contains the arguments as indices into an ATermTable being an injective "IntMap ShATerm" with update and lookup functions: addATerm :: ShATerm -> ATermTable -> (ATermTable, Int) getATerm :: Int -> ATermTable -> ShATerm The class (that DriFT derives instances for) looks as follows: class ShATermConvertible t where toShATerm :: ATermTable -> t -> (ATermTable, Int) fromShATerm :: (ATermTable, Int) -> t The automatically derived instance of i.e. a user-defined List data type "data List a = Nil | Cons a (List a)" is: instance ShATermConvertible a => ShATermConvertible (List a) where toShATerm att0 Nil = addATerm (ShAAppl "Nil" []) att0 toShATerm att0 (Cons a b) = case toShATerm att0 a of { (att1, a') -> case toShATerm att1 b of { (att2, b') -> addATerm (ShAAppl "Cons" [a', b']) att2 }} fromShATerm (att, i) = case getATerm i att of ShAAppl "Nil" [] -> Nil ShAAppl "Cons" [a, b] -> case fromShATerm (att, a) of { a' -> case fromShATerm (att, b) of { b' -> Cons a' b' }} _ -> error "ShATermConvertible List" (Internally the ATermTable also keeps an inverse "Map ShATerm Int" to support the efficient implementation of addATerm.) We can use the "baffle" tool to convert our files from shared (TAF) to unshared ATerm format and vice versa.