
I'm wondering if there are any libraries out there for creating parsers that lazily build up their result. I know I could thread the remaining input through a parser by hand, but it seems like someone should have already done it. I'd like to be able to turn a stream of XML into a lazy tree of tags (probably Maybe tags, or Either errors tags), but I don't think HaXml and the like do that sort of thing. Branodn Moore

Brandon Michael Moore wrote:
I'm wondering if there are any libraries out there for creating parsers that lazily build up their result. I know I could thread the remaining input through a parser by hand, but it seems like someone should have already done it.
This turns out to be rather difficult to do in the general case (but see below -- XML is a special case). If you have type Parser sym result = [sym] -> Maybe (result, [sym]) a Parser can't decide whether to return 'Just (result,rest)' or 'Nothing' until it has successfully parsed the complete result. So pattern matching on the parser's return value will force the entire production. Variations on the theme -- Either instead of Maybe, list-of-successes, continuation-passing combinators, etc -- all face a similar problem. However, if your top-level grammar is of the form: things :: empty | thing things {- == thing* -} then instead of: case runParser (pMany pThing) input of Just (result,[]) -> ... you can use something like unfoldr (runParser pThing) input to build the result list incrementally. This will be less eager; instead of parsing and returning an entire list of Things, it parses one Thing at a time. Another thing to watch out for is heap drag. The list-of-successes approach tends to retain the entire input, just in case the parser needs to backtrack. Parsec [1] and UU_Parsing [?] solve this by severely restricting the amount of required lookahead.
I'd like to be able to turn a stream of XML into a lazy tree of tags (probably Maybe tags, or Either errors tags), but I don't think HaXml and the like do that sort of thing.
That's exactly how HXML [2] works. The parser returns a lazy
list of tokens (analogous to SAX events), which are folded up
into a tree by a separate function. In addition it uses a CPS
parser library so (as with Parsec), there is minimal heap drag.
[1] Parsec:

Brandon Michael Moore wondered: | I'm wondering if there are any libraries out there | for creating parsers that lazily build up their | result. Joe English answered: | type Parser sym result = [sym] -> Maybe (result, [sym]) However, a parser for a particular type T might actually just return T directly, without any Maybe in the way, if T has a failure constructor built-in. Something like: data T = Plus T T | ... | Fail Error Now, a parser for T can already start building its result without having to wait for the whole thing to succeed. This gets a bit ugly, especially when dealing with many types that are being parsed, or with the extra failure constructor always being part of the datatype. One way of adding such a constructor to any type would be: data Result a = Val a | forall b c . Apply2 (b -> c -> Result a) (Result b) (Result c) | Fail Error This gives you a way of creating an 'a' not only lazily, but one can decide oneself what to do with failure and when to do it. Example: When parsing the following faulty expression "(1+2)+blurp" And expecting an Int as a result, the parser would build the following expression: Apply2 (\x y -> Val (x+y)) (Apply2 (\x y -> Val (x+y)) (Val 1) (Val 2)) (Fail "unrecognized `blurp'") None of this is type checked and surely contains bugs, but I hope you get the idea. The hope is that a value of type `Result a' gives you more information and better space behavior than just a value of type `Maybe a'. /Koen. -- Koen Claessen http://www.cs.chalmers.se/~koen Chalmers University, Gothenburg, Sweden.
participants (3)
-
Brandon Michael Moore
-
Joe English
-
Koen Claessen