[Haskell-cafe] Designing DSL with explicit sharing [was: I love purity, but it's killing me]

13 Feb 2008

      Tom Hawkins  wrote:
] My DSLs invariably define a datatype to capture expressions; something
] like this:
]
] data Expression
]   = Add Expression Expression
]   | Sub Expression Expression
]   | Variable String
]   | Constant Int
]   deriving Eq
] The problem comes when I want to generate efficient code from an
] Expression (ie. to C or some other target language).  The method I use
] invovles converting the tree of subexpressions into an acyclic graphic
] to eliminate common subexpressions.  The nodes are then topologically
] ordered and assigned an instruction, or statement for each node.  For
] example:
]
] let a = Add (Constant 10) (Variable "i1")
]     b = Sub (Variable "i2") (Constant 2)
]     c = Add a b
] The process of converting an expression tree to a graph uses either Eq
] or Ord (either derived or a custom instance) to search and build a set
] of unique nodes to be ordered for execution.  In this case "a", then
] "b", then "c".  The problem is expressions often have shared,
] equivalent subnodes, which dramatically grows the size of the tree.
] For example:
]
] let d = Add c c
]     e = Add d d    -- "e" now as 16 leaf nodes.
]
] As these trees grow in size, the equality comparison in graph
] construction quickly becomes the bottleneck for DSL compilation.

We show the design of the same sort of DSL that explicitly maintains
sharing information and where node comparisons are quick, because we
are comparing hashes rather than trees themselves. Our approach is
assuredly safe, pure, and Haskell98 (save for disabling of the
monomorphism restriction, which is done solely to avoid writing
signatures). No GHC-specific behavior is relied upon.
...
{-# OPTIONS_GHC -fno-monomorphism-restriction #-}
module DSL where
import Data.IntMap as IM
import Control.Monad.State
The approach is based on the final tagless representation. Here is our
DSL:
...
class Exp repr where
   constant :: Int -> repr Int
   variable :: String -> repr Int
   add :: repr Int -> repr Int -> repr Int
   sub :: repr Int -> repr Int -> repr Int
Tom Hawkins' test expressions now look as follows
...
a = add (constant 10) (variable "i1")
b = sub (variable "i2") (constant 2)
c = add a b
d = add c c
e = add d d
which is the same as before modulo the case of the identifiers:
everything is in lower case.

We can show the expressions as before: showing is one way
of evaluating things
...
newtype S t = S{unS :: String}
instance Exp S where
    constant x = S $ show x
    variable x = S x
    add e1 e2 = S( unS e1 ++ " + " ++ unS e2)
    sub e1 e2 = S( unS e1 ++ " - " ++ unS e2)
test_showe = unS e
*DSL> test_showe
"10 + i1 + i2 - 2 + 10 + i1 + i2 - 2 + 10 + i1 + i2 - 2 + 10 + i1 + i2 - 2"

We can write an evaluator for the expressions
...
type REnv = [(String,Int)]
newtype R t = R{unR :: REnv -> t} -- A reader Monad, actually
instance Exp R where
    constant x = R $ const x
    variable x = R ( \env -> maybe (error $ "no var: " ++ x) id $ 
           Prelude.lookup x env)
    add e1 e2 = R(\env -> unR e1 env + unR e2 env)
    sub e1 e2 = R(\env -> unR e1 env - unR e2 env)
...
test_vale = unR e [("i1",5),("i2",10)] -- 92
We stress: we are using exactly the same expression e as before. 
We are only evaluating it differently. The gist of the final tagless
approach is to write an expression once and evaluate it many times.

Now, we chose a different representation: to make sharing explicit
We chose not to rely on GHC; we don't care if in (add c c), the two
c are shared or copied. It is not observable in pure Haskell,
and we don't care. We build our acyclic graph nevertheless.
...
type ExpHash = Int
We stress: ACode is NOT a recursive data structure, so the comparison
of ACode values takes constant time!
...
data ACode = AConst ExpHash |
         AVar   ExpHash |
         AAdd   ExpHash |
         ASub   ExpHash
       deriving (Eq,Show)
...
data ExpMaps = ExpMaps{ hashcnt :: ExpHash, -- to generate new Hash
      	  ctmap :: IntMap Int,
      	  vrmap :: IntMap String,
      	  admap :: IntMap (ACode,ACode),
      	  sumap :: IntMap (ACode,ACode)}
       deriving Show
exmap0 = ExpMaps 0 IM.empty IM.empty IM.empty IM.empty
newtype A t = A{unA :: State ExpMaps ACode}
Granted, the following could be done far more efficiently: we need
bimaps.
...
loookupv :: Eq v => v -> IntMap v -> Maybe Int
loookupv v = IM.foldWithKey 
       (\k e z -> maybe (if e == v then Just k else Nothing) (const z) z) 
       Nothing

...
record con prj upd x = do
  s <- get
  maybe (do let s' = upd (s{hashcnt = succ (hashcnt s)})
                 (IM.insert (hashcnt s) x (prj s))
        put s'
        return (con $ hashcnt s))
        (return . con) $ loookupv x (prj s)
...
instance Exp A where
    constant x = A(record AConst ctmap (\s e -> s{ctmap = e}) x)
    variable x = A(record AVar   vrmap (\s e -> s{vrmap = e}) x)
    add e1 e2  = A(do
           h1 <- unA e1
           h2 <- unA e2
           record AAdd admap (\s e -> s{admap = e}) (h1,h2))
    sub e1 e2  = A(do
           h1 <- unA e1
           h2 <- unA e2
           record ASub sumap (\s e -> s{sumap = e}) (h1,h2))
Again, we are using the very same expression e we wrote at the very
beginning:
...
test_sme = runState (unA e) exmap0
*DSL> test_sme
(AAdd 8,
  ExpMaps {hashcnt = 9, 
    ctmap = fromList [(0,10),(4,2)], 
    vrmap = fromList [(1,"i1"),(3,"i2")], 
    admap = fromList [(2,(AConst 0,AVar 1)),(6,(AAdd 2,ASub 5)),
                      (7,(AAdd 6,AAdd 6)),(8,(AAdd 7,AAdd 7))], 
    sumap = fromList [(5,(AVar 3,AConst 4))]})

We retain all the information about expression 'e'. In addition, all
sharing is fully explicit. As we can see, the evaluation process finds
common subexpressions automatically.

[Haskell-cafe] Designing DSL with explicit sharing [was: I love purity, but it's killing me]

oleg＠okmij.org