
On Wed, 2009-10-07 at 11:32 -0400, Joe Fredette wrote:
So, if I understand this:
Parsec is a DSL, I'm going to venture it's a "Deep embedding" -- I don't understand the internals, but if I were to build something like Parsec, I would probably build up a "Parser" datastructure and then apply optimizations to it, then "run" it with another function.
Am I on the right track here?
Parsec, like most other parser combinator libraries, is a shallowly embedded DSL. The "Parser a" type is a Haskell function that does parsing, i.e. a function of type String -> Maybe (String, a). (Obviously, the real Parsec library allows more than strings, and has better error reporting than this type, but this is the basic idea). You can't analyse it further---you can't transform it into another grammar to optimise it or print it out---because the information about what things it accepts has been locked up into a non-analysable Haskell function. The only thing you can do with it is feed it input and see what happens. A deep embedding of a parsing DSL (really a context-sensitive grammar DSL) would look something like the following. I think I saw something like this in the Agda2 code somewhere, but I stumbled across it when I was trying to work out what "free" applicative functors were. First we define what a production with a semantic action is, parameterised by the type of non-terminals in our grammar and the result type:
data Production nt a = Stop a | Terminal Char (Production nt a) | forall b. NonTerminal (nt b) (Production nt (b -> a))
You can think of a production as a list of either terminals or non-terminals, terminated by the "value" of that production. The non-regular nested type argument in NonTerminal means that the final value can depend on the values that will be returned when parsing the strings that match other non-terminals. Productions are functors:
instance Functor (Production nt) where fmap f (Stop a) = Stop (f a) fmap f (Terminal c p) = Expect c (fmap f p) fmap f (NonTerminal nt p) = NonTerminal nt (fmap (fmap f) p)
They are also applicative functors:
instance Applicative (Production nt) where pure = Stop (Stop f) <*> a = fmap f a (Terminal c t) <*> a = Terminal c (t <*> a) (NonTerminal nt t) <*> a = NonTerminal nt (fmap flip t <*> a)
A rule in one of our grammars is just a list of alternative productions:
newtype Rule nt a = Rule [Production nt a]
Since lists are (applicative) functors and (applicative) functors compose, Rule nt is also a Functor and Applicative functor:
instance Functor (Rule nt) where fmap f (Rule l) = Rule (fmap (fmap f) l)
instance Applicative (Rule nt) where pure x = Rule $ pure (pure x) (Rule lf) <*> (Rule la) = Rule $ (<*>) <$> lf <*> la
It is also an instance of Alternative, because we composed with lists:
instance Alternative (Rule nt) where empty = Rule [] (Rule r1) <|> (Rule r2) = Rule $ r1 <|> r2
A grammar is a map from nonterminals to rules, which are lists of alternative productions, which may themselves refer back to nonterminals in the grammar:
type Grammar nt = forall a. nt a -> Rule nt a
Given a value of type "Grammar nt", and a starting nonterminal in "nt a" for some "a", one can easily write a function that translates it into a Parsec grammar to do actual parsing, or implement a different parsing strategy using memoisation or something similar. The translation to a traditional parser combinator library is actually a (indexed-)homomorphism of applicative functors + extra operations, which is pretty cool. If you also know some extra facts about the "nt" type (e.g. that it is finite), then it should be possible implement an CYK or Earley parser using this, or to print out the grammar (for documentation purposes, or for telling another node in a distributed network what things you accept, for instance). Note that these grammars are strictly less powerful than the ones that can be expressed using Parsec because we only have a fixed range of possibilities for each rule, rather than allowing previously parsed input to determine what the parser will accept in the future. This is the fundamental reason for using the applicative functor interface rather than the monad interface here. I'll give an example grammar for parsing expressions modelled by the following data type:
data Expr = ENum Int | ESum Expr Expr | EProduct Expr Expr deriving Show
To define a grammar in this formalism, one first has to define the set of nonterminals that one wants to use:
data NT a where Value :: NT Expr Product :: NT Expr Sum :: NT Expr
Now, a grammar is simply a function from members of this type to productions. We use the applicative/alternative functor interface to build up the productions. Conor's SHE would make this look a lot nicer, using idiom brackets.
myGrm :: Grammar NT myGrm Value = ENum <$> posInt <|> id <$ char '(' <*> nt Sum <* char ')'
myGrm Product = EProduct <$> nt Value <* char '*' <*> nt Product <|> id <$> nt Value
myGrm Sum = ESum <$> nt Product <* char '+' <*> nt Sum <|> id <$> nt Product
This needs a couple of simple functions to make things look nice:
char :: Char -> Rule nt () char c = Rule [Terminal c $ Stop ()]
nt :: nt a -> Rule nt a nt nonterminal = Rule [NonTerminal nonterminal $ Stop id]
And a general definition for parsing single-digit numbers. This works for any set of non-terminals, so it is a reusable component that works for any grammar:
choice :: Alternative f => [f a] -> f a choice = foldl (<|>) empty
digit :: Rule nt Int digit = choice [ x <$ char (intToDigit x) | x <- [0..9] ]
posInt :: Rule nt Int posInt = fix 1 . reverse <$> some digit where fix n [] = 0 fix n (d:ds) = d*n + fix (n*10) ds
Bob -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.