New subject: summary of storing highly shared data structures

22 Dec 2005

      Dear Haskell Experts,

for storing highly shared data structures we use so called Annotated
Terms (shortly ATerms, details below).

   http://www.cwi.nl/htbin/sen1/twiki/bin/view/SEN1/ATerm

In contrast to the Binary (or GhcBinary) class we compute the sharing,
which saves a lot of space for data types that keep redundant
information.

With this we can store some of our data structures (of course only
non-cyclic and finite ones) in a few KBs that need MBs if stored
without sharing (as when using the Binary or the Show/Read
classes).

So far so good. The problem remaining is that an object is _traversed_
as if being unshared and thus the _time_ for the ATermTable construction
becomes too long for us.

GHC's internal data structures (on the heap) are in many cases shared,
by pointer references. I.e. if I add a (single) symbol table to every 
symbol that I use, then the symbol table will not be copied, but only a
reference added to my symbol.

How can I detect this sharing in order to avoid traversing the very
same symbol table for every symbol?

I've tried to use a "Map (Ptr ()) ShATerm". So before traversing an
object I look up its address and check if is was traversed before (if
not I traverse it and store it in my map for future lookups).

1.) I'm not sure if it is safe to use "Ptr ()" (meanwhile garbage
collection, heap compaction and what not could happen).

2.) A single "Map (Ptr ()) ShATerm" does not work! It seems that
sometimes objects are shared that have different types (as tests
revealed). This seems obvious for newtypes but also happens (I think)
without newtype declarations.

However, shared ATerms are always different for different types,
because the corresponding data constructors are different.

So, I'm now thinking if I should try using the type of my data as key
as well, i.e. using "Map (TypeRep, Ptr ()) ShATerm" to avoid duplicate
traversals.

Would this be worth the effort? What other possibilities could I try?

An obvious solution is to avoid the redundancy in the first place
(i.e. simply don't store the symbol table with every symbol), but that
would require a major change of our sources, and would the code become
clearer? (Many functions would need more arguments.)

Finally, _reading in_ shared ATerms is fast, since ghc seems to exploit 
the sharing given by the injective ATermTable.

Thanks for any help
Christian

More details:

   Brand, M.G.J. van den, H.A. de Jong, P. Klint, and P.A. Olivier (2000).
   "Efficient Annotated Terms." Software -- Practice & Experience 
30:259--291.

   (The link "ATerm Library" http://www.cwi.nl/projects/MetaEnv/haterm/
   under http://www.haskell.org/libraries/#compilation looks outdated
   to me. We have extra source code that - at the moment - cannot be
   distributed separately)

Via an extension to DrIFT,
http://repetae.net/john/computer/haskell/DrIFT, we generate class
instances (similar to the Binary class) for our types that we want to
store.

Our ATerms are (basically) constructor terms of the form:

data ShATerm = ShAAppl String [Int] deriving (Eq, Ord)

(The annotation part of ATerms is omitted here since we don't use it)

The Int-list contains the arguments as indices into an ATermTable
being an injective "IntMap ShATerm" with update and lookup
functions:

addATerm :: ShATerm -> ATermTable -> (ATermTable, Int)
getATerm :: Int -> ATermTable -> ShATerm

The class (that DriFT derives instances for) looks as follows:

class ShATermConvertible t where
     toShATerm       :: ATermTable -> t -> (ATermTable, Int)
     fromShATerm     :: (ATermTable, Int) -> t

The automatically derived instance of i.e. a user-defined List data
type "data List a = Nil | Cons a (List a)" is:

instance ShATermConvertible a => ShATermConvertible (List a) where
     toShATerm att0 Nil =
         addATerm (ShAAppl "Nil" []) att0
     toShATerm att0 (Cons a b) =
         case toShATerm att0 a of { (att1, a') ->
         case toShATerm att1 b of { (att2, b') ->
         addATerm (ShAAppl "Cons" [a', b']) att2 }}
     fromShATerm (att, i) =
         case getATerm i att of
             ShAAppl "Nil" [] -> Nil
             ShAAppl "Cons" [a, b] ->
                     case fromShATerm (att, a) of { a' ->
                     case fromShATerm (att, b) of { b' ->
                     Cons a' b' }}
             _ -> error "ShATermConvertible List"

(Internally the ATermTable also keeps an inverse "Map ShATerm Int" to
support the efficient implementation of addATerm.)

We can use the "baffle" tool to convert our files from shared
(TAF) to unshared ATerm format and vice versa.

storing highly shared data structures

Christian Maeder

Einar Karttunen

Christian Maeder

Benjamin Franksen

Christian Maeder

Bulat Ziganshin

Christian Maeder

tags

participants (4)