Parser inversions [Was: CC-delcont-0.1...]

Dan Doel wrote about `inverting' a parser -- first, a pure parser consuming a string and later a parser written in a monadic style and consuming a monadic list: data MList' m a = MNil | MCons a (MList m a) type MList m a = m (MList' m a) The second attempt proved fully successful:
So, success! Tokens are printed out as soon as the lexer is able to recognize them, properly interleaved with other IO side effects, and resuming from an intermediate parse does not cause duplication of output.
Indeed, one may say `inverting' a parser is differentiating it. That is, we wish parse incrementally, to observe semantic actions as they are being performed. If the parser is a pure function then the only possible observation is that of its result given a specific input. To differentiate such a parser we feed it progressively longer inputs and observe differences in the output, the stream of parsed tokens. The first attempt followed this `classical calculus' recipe and wasn't fully successful, even though the parser used in that attempt was a good, _continuous_ parser. The parser for conventional arithmetic expressions such as "1+2" can generate parsed token 1 after it saw the first two characters, no matter what follows further in the input stream. Giving longer input to the parser will _progressively_ cause longer output. Alas, there are _discontinuous_ parsers, such XML DOM parsers. Let's consider the input <root><a>text</a><b/></root> Until a pure parser sees </root> (that is, the entire input document) it produces nothing at all (except errors that the XML document is ill-formed). Indeed, ill-formed documents have no Infoset (no `meaning') and hence DOM parsers must not return any parsed data for incomplete documents. As in calculus, differentiating such discontinuous functions is not helpful. Now, if the parser is monadic then effects and their sequence become observable too. We can truly observe the sequence of executed semantic actions. These actions are executed incrementally, as the parser recognized a part of input to which a semantic action applies. Delimited continuations, as a `scriptable debugger', let us observe the execution of the parser step by step. One way to differentiate is to embed `breakpoints' in the input stream, in the get-char-like routine the parser uses to obtain next piece of input. Such routines are typically under user control, who can embed breakpoints, aka shift. The MList data structure is a clever way to `hide' get-char routines out of sight. Another way to observe the sequence of actions is to embed breakpoints into semantic actions themselves. In many parsers, the grammar with semantic actions is provided by users and thus again under their control. This approach can `invert' a DOM XML parser into a streaming, pull-like SAX XML parser, provided the DOM parser exports suitable callbacks: http://okmij.org/ftp/Scheme/xml.html#SSAX-pull
So, that wasn't really that hard to hack up. However, I should mention that it wasn't trivial, either. When converting list functions to MList functions, you have to be very careful not to perform side effects twice.
Indeed: the major complexity of LogicT was making sure effects are not done twice.
... many recursive data structures can be expressed as the fixed-point of shape functors. The kicker is, you can get the monadic version for free.
Indeed, open recursion at the term level lets us transparently interpose, for example, memoization. The benefits of coding recursive data types with explicit open recursion have been described in Two-Level Types and Parameterized Modules Time Sheard and Emir Pasalic JFP v14 (5):547-587, Sept. 2004 http://web.cecs.pdx.edu/~sheard/papers/JfpPearl.ps In particular, please see the Section 8, History, near the end of the paper. Thank you indeed for packaging CC monad and LogicT!
participants (1)
-
oleg@pobox.com