[Haskell-cafe] Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

12 May 2007

      Henning Thielemann wrote:
...
I want to parse and process HTML lazily. I use HXT because the HTML parser
is very liberal. However it uses Parsec and is thus strict. HaXML has a
so called lazy parser, but it is not what I consider lazy:
[...]
Note that lazy parsing is inherently difficult and most often more or
less impossible. The fundamental problem is that one cannot really
return a result before deciding whether the given input is syntactically
correct or not. In other words, take the malformed HTML

   <html>
   <head></head>
     <body>
       <h1>Hello Lazy Parser
     </body>
   </html>

What results should a lazy parser return before emitting ⊥? At the time
you read the <html>-tag, you cannot know whether a syntax error far down
in the file makes it invalid. Thus, you may not return the top-most
<html>-tag until you see the closing </html>.

Furthermore, the fact that data appears in sequence disturbs on-demand
processing very much. For instance, take the the following XML-Tree

   <Node>
     <Node>
       <Leaf>1</Leaf>
       <Leaf>2</Leaf>
     </Node>
     <Leaf>3</Leaf>
   </Node>

modeled after the definition

   data Tree a = Leaf a | Node (Tree a) (Tree a)

Assume that processing only needs the left branch of the tree. In this
case, we do not need to parse (Leaf 3). But assume that we only need the
right branch. Much to our horror, we have to completely parse the left
branch (Node (Leaf 1) (Leaf 2)) as well, simply because it appears
before the right branch!

I think that at least the syntax-correctness problem can be solved by a
two separate parse passes: first check that the input is syntactically
correct, then lazily build the values. However, monadic parser
combinators are unsuitable for that because they allow the syntax
correctness to depend on the built values. You need f.i. applicative
parser combinators if you want to derive both passes from one parser
description.

Regards,
apfelmus