
Henning Thielemann wrote:
I want to parse and process HTML lazily. I use HXT because the HTML parser is very liberal. However it uses Parsec and is thus strict. HaXML has a so called lazy parser, but it is not what I consider lazy: [...]
Note that lazy parsing is inherently difficult and most often more or less impossible. The fundamental problem is that one cannot really return a result before deciding whether the given input is syntactically correct or not. In other words, take the malformed HTML <html> <head></head> <body> <h1>Hello Lazy Parser </body> </html> What results should a lazy parser return before emitting ⊥? At the time you read the <html>-tag, you cannot know whether a syntax error far down in the file makes it invalid. Thus, you may not return the top-most <html>-tag until you see the closing </html>. Furthermore, the fact that data appears in sequence disturbs on-demand processing very much. For instance, take the the following XML-Tree <Node> <Node> <Leaf>1</Leaf> <Leaf>2</Leaf> </Node> <Leaf>3</Leaf> </Node> modeled after the definition data Tree a = Leaf a | Node (Tree a) (Tree a) Assume that processing only needs the left branch of the tree. In this case, we do not need to parse (Leaf 3). But assume that we only need the right branch. Much to our horror, we have to completely parse the left branch (Node (Leaf 1) (Leaf 2)) as well, simply because it appears before the right branch! I think that at least the syntax-correctness problem can be solved by a two separate parse passes: first check that the input is syntactically correct, then lazily build the values. However, monadic parser combinators are unsuitable for that because they allow the syntax correctness to depend on the built values. You need f.i. applicative parser combinators if you want to derive both passes from one parser description. Regards, apfelmus