Parser inversions [Was: CC-delcont-0.1...]

2 Aug 2007

      Dan Doel wrote about `inverting' a parser -- first, a pure parser
consuming a string and later a parser written in a monadic style and
consuming a monadic list:

    data MList' m a = MNil | MCons a (MList m a)
    type MList m a = m (MList' m a)

The second attempt proved fully successful:
...
So, success! Tokens are printed out as soon as the lexer is able to recognize
them, properly interleaved with other IO side effects, and resuming from an
intermediate parse does not cause duplication of output.
Indeed, one may say `inverting' a parser is differentiating it. That
is, we wish parse incrementally, to observe semantic actions as they
are being performed. If the parser is a pure function then the only
possible observation is that of its result given a specific input. To
differentiate such a parser we feed it progressively longer inputs and
observe differences in the output, the stream of parsed tokens. The
first attempt followed this `classical calculus' recipe and wasn't
fully successful, even though the parser used in that attempt was a
good, _continuous_ parser. The parser for conventional arithmetic
expressions such as "1+2" can generate parsed token 1 after it saw the
first two characters, no matter what follows further in the input
stream. Giving longer input to the parser will _progressively_ cause
longer output.

Alas, there are _discontinuous_ parsers, such XML DOM parsers. 
Let's consider the input  
	<root><a>text</a><b/></root>

Until a pure parser sees </root> (that is, the entire input document) it
produces nothing at all (except errors that the XML document is
ill-formed). Indeed, ill-formed documents have no Infoset (no
`meaning') and hence DOM parsers must not return any parsed data for
incomplete documents. As in calculus, differentiating such
discontinuous functions is not helpful.

Now, if the parser is monadic then effects and their sequence become
observable too. We can truly observe the sequence of executed semantic
actions. These actions are executed incrementally, as the parser
recognized a part of input to which a semantic action
applies. Delimited continuations, as a `scriptable debugger', let us
observe the execution of the parser step by step.

One way to differentiate is to embed `breakpoints' in the input
stream, in the get-char-like routine the parser uses to obtain next
piece of input. Such routines are typically under user control, who
can embed breakpoints, aka shift. The MList data structure is a clever
way to `hide' get-char routines out of sight.

Another way to observe the sequence of actions is to embed breakpoints
into semantic actions themselves. In many parsers, the grammar with
semantic actions is provided by users and thus again under their
control. This approach can `invert' a DOM XML parser into a streaming,
pull-like SAX XML parser, provided the DOM parser exports suitable
callbacks:

	http://okmij.org/ftp/Scheme/xml.html#SSAX-pull
...
So, that wasn't really that hard to hack up. However, I should mention
that it wasn't trivial, either. When converting list functions to
MList functions, you have to be very careful not to perform side
effects twice.
Indeed: the major complexity of LogicT was making sure effects are
not done twice.
...
... many recursive data structures can be expressed as the fixed-point
of shape functors. The kicker is, you can get the monadic version for
free.
Indeed, open recursion at the term level lets us transparently
interpose, for example, memoization. The benefits of coding recursive
data types with explicit open recursion have been described in

	Two-Level Types and Parameterized Modules
	Time Sheard and Emir Pasalic
	JFP v14 (5):547-587, Sept. 2004
	http://web.cecs.pdx.edu/~sheard/papers/JfpPearl.ps
In particular, please see the Section 8, History, near the end of the
paper.

Thank you indeed for packaging CC monad and LogicT!

oleg＠pobox.com

tags

participants (1)