Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Henning Thielemann

11 May 2007 11 May '07

12:24 p.m.

I want to parse and process HTML lazily. I use HXT because the HTML parser is very liberal. However it uses Parsec and is thus strict. HaXML has a so called lazy parser, but it is not what I consider lazy: *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>" *** Exception: Prelude.undefined *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>" *** Exception: Expected " runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined) ("*** Exception: Parse.satisfy: failed If it would have return type (String, a) it could return both a partial value of type 'a' and the error message as String. It would be even better if it has some handling for incorrect input texts, and returns ([String], a), where [String] is the type of a list of warnings and error messages and 'a' is the type of a total value of parse output. Is there some parser of this type? Unfortunately http://www.haskell.org/haskellwiki/Applications_and_libraries/Compiler_tools does not compare the laziness of the mentioned parsers.

Show replies by date

Neil Mitchell

11 May 11 May

12:32 p.m.

Hi Depending on exactly what you want, TagSoup may be of interest to you. It is lazy, but it doesn't return a tree. It is very tollerant of errors, and will simply never "fail to parse" something. http://www-users.cs.york.ac.uk/~ndm/tagsoup/ Thanks Neil On 5/11/07, Henning Thielemann wrote:

...

I want to parse and process HTML lazily. I use HXT because the HTML parser is very liberal. However it uses Parsec and is thus strict. HaXML has a so called lazy parser, but it is not what I consider lazy:

*Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>" *** Exception: Prelude.undefined *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>" *** Exception: Expected "
If it would be lazy, it would return some HTML code before the error. HaXML uses the Polyparse package for parsing which contains a so called lazy parser. However it has return type (Either String a). That is, for the decision whether the parse was successful, the document has to be parsed completely.

*Text.ParserCombinators.PolyLazy> runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined) ("*** Exception: Parse.satisfy: failed

If it would have return type (String, a) it could return both a partial value of type 'a' and the error message as String. It would be even better if it has some handling for incorrect input texts, and returns ([String], a), where [String] is the type of a list of warnings and error messages and 'a' is the type of a total value of parse output.

Is there some parser of this type? Unfortunately http://www.haskell.org/haskellwiki/Applications_and_libraries/Compiler_tools does not compare the laziness of the mentioned parsers. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Henning Thielemann

12:54 p.m.

On Fri, 11 May 2007, Neil Mitchell wrote:

...

Depending on exactly what you want, TagSoup may be of interest to you. It is lazy, but it doesn't return a tree. It is very tollerant of errors, and will simply never "fail to parse" something.

http://www-users.cs.york.ac.uk/~ndm/tagsoup/

That's an interesting option. It could be used as a lexer for a full-blown HTML parser. Sometimes I need the tree structure. But why does this simple piece of code needs -fglasgow-exts? Thanks for the package and the hint! Henning

Neil Mitchell

12:58 p.m.

...

That's an interesting option. It could be used as a lexer for a full-blown HTML parser. Sometimes I need the tree structure. But why does this simple piece of code needs -fglasgow-exts?

It doesn't. The released version 0.1 doesn't require extensions, and the next 0.2 won't either. In the meantime I accepted a patch from a user that added a new feature and required the flag. I'm going to rework it shortly, make a few tweaks, and remove that flag. Thanks Neil

Jules Bean

12:52 p.m.

New subject: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Henning Thielemann wrote:

...

I want to parse and process HTML lazily. I use HXT because the HTML parser is very liberal. However it uses Parsec and is thus strict. HaXML has a so called lazy parser, but it is not what I consider lazy:

*Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>" *** Exception: Prelude.undefined *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>" *** Exception: Expected "
If it would be lazy, it would return some HTML code before the error.

Are you sure that it is the parser, that is not lazy, and it isn't that the pretty printer is overly strict? From the evidence above the parser could be returning some results before the error, and the pretty printer strictly slurping it all up to the error and then dying. Jules

Henning Thielemann

1:06 p.m.

On Fri, 11 May 2007, Jules Bean wrote:

...

Henning Thielemann wrote:

...
I want to parse and process HTML lazily. I use HXT because the HTML parser is very liberal. However it uses Parsec and is thus strict. HaXML has a so called lazy parser, but it is not what I consider lazy:

*Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>" *** Exception: Prelude.undefined *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>" *** Exception: Expected "
If it would be lazy, it would return some HTML code before the error.

Are you sure that it is the parser, that is not lazy, and it isn't that the pretty printer is overly strict?

From the evidence above the parser could be returning some results before the error, and the pretty printer strictly slurping it all up to the error and then dying.

I know, but the type of the Polyparse parser prohibits lazy parsing. Unfortunately there is no Show instance for HaXML trees, so one cannot easily see whether laziness gets lost in the parser or in the pretty printer.

Malcolm Wallace

2:57 p.m.

Henning Thielemann wrote:

...

HaXml has a so called lazy parser, but it is not what I consider lazy:

Lazy parsing is rather subtle, and it is easy to write a too-strict parser when one intended to be more lazy. Equally, it can be easy to imagine that the parser is too strict, when in fact it is the usage context that is wrong. You have indeed found some bugs in HaXml's lazy HTML parser, but you have also partly misunderstood what lazy parsing means.

...

Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>" *** Exception: Prelude.undefined

The problem here is not that the parser is too strict, but that the pretty-printer is. The pretty-printer is demanding an undefined portion of the value before it produces any output.

...

If it would be lazy, it would return some HTML code before the error.

It can do, but only if you consume the part without errors first! For instance, it would be safe to extract just the <head> tag, because that is complete: import Text.XML.HaXml import Text.XML.HaXml.Posn import Text.XML.HaXml.Html.ParseLazy as Lazy import Text.XML.HaXml.Pretty as PP import Text.PrettyPrint.HughesPJ main = putStrLn $ render $ fsep PP.content $ -- the following line extracts just the first child tag of <html> (\(Document _ _ e _)-> (position 0 children) (CElem e nopos)) $ Lazy.htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>" Unfortunately, this program currently does throw an "undefined" exception, even though it should not. The lazy HTML parser contains a couple of tricky corners that _probably_ stop it from being lazy. (1) The element parser does not immediately return an element after seeing its start tag, because it also has to return a stack of improperly terminated elements inside this one (so they can be repaired). (2) After parsing, we simplify the tree structure, which of course traverses it, and may again force too much evaluation. In any case, I will need to investigate further, and hopefully soon push a patch to fix the problem.

...

HaXML uses the Polyparse package for parsing which contains a so called lazy parser. However it has return type (Either String a). That is, for the decision whether the parse was successful, the document has to be parsed completely.

Not true. PolyLazy.runParser has the signature runParser :: Parser t a -> [t] -> (a, [t]) that is, it returns the partially parsed value (which may contain bottoms), and the remaining unparsed token-stream. (Examining either of these return values may cause sufficient evaluation to be forced to lead to a runtime exception.) There is no 'Either' type at the user level. (Although an Either is used internally, see below, it does not do what you think).

...

*Text.ParserCombinators.PolyLazy> runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined) ("*** Exception: Parse.satisfy: failed

This output is exactly correct. You asked for the first four characters provided that they were alphabetic, but in fact only the first three were alphabetic. Hence, 'satisfy' failed and threw an exception. If you ask for only the first three characters, then the parse succeeds:

...

fst $ runParser (exactly 3 (satisfy Char.isAlpha)) ("abc104"++undefined) "abc"

The purpose of the internal Either type that you mentioned, is to permit backtracking within the parse, not to force complete evaluation. Thus, you can equally ask for the first four characters provided they are alphanumeric, where alphanumeric is decided by a combination of alternate parsers:

...

fst $ runParser (exactly 4 (satisfy Char.isAlpha `onFail` satisfy Char.isDigit)) ("abc104"++undefined) "abc1"

This example illustrates that a parse failure is still recoverable when parsing lazily (but only by another parser, not once the failure has escaped the parsing world). Regards, Malcolm

Henning Thielemann

14 May 14 May

1:36 p.m.

On Fri, 11 May 2007, Malcolm Wallace wrote:

...

...
*Text.ParserCombinators.PolyLazy> runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined) ("*** Exception: Parse.satisfy: failed

This output is exactly correct. You asked for the first four characters provided that they were alphabetic, but in fact only the first three were alphabetic. Hence, 'satisfy' failed and threw an exception. If you ask for only the first three characters, then the parse succeeds:

The problem is obviously that a later wrong character can make the whole parse fail. Thus successful generated data is not returned until the whole input is parsed and checked. How can I suppress checking the whole input? How can I tell the parser that everything it parsed so far will not be invalidated by further input? How can I rewrite the above example that it returns ("abc*** Exception: Parse.satisfy: failed ? I wondered whether 'commit' helps, but it didn't. (I thought it would convert a global 'fail' to a local 'error'.) *Text.ParserCombinators.PolyLazy> runParser (exactly 4 (commit (satisfy Char.isAlpha))) ("abc104"++undefined) *** Exception: Parse.satisfy: failed

Malcolm Wallace

2:18 p.m.

Henning Thielemann wrote:

...

...
...
*Text.ParserCombinators.PolyLazy> runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined) ("*** Exception: Parse.satisfy: failed

How can I rewrite the above example that it returns ("abc*** Exception: Parse.satisfy: failed

The problem in your example is that the 'exactly' combinator forces parsing of 'n' items before returning any of them. If you roll your own, then you can return partial results: > let one = return (:) `apply` satisfy (Char.isAlpha) in runParser (one `apply` (one `apply` (one `apply` (one `apply` return [])))) ("abc104"++undefined) ("abc*** Exception: Parse.satisfy: failed Equivalently: > let one f = ((return (:)) `apply` satisfy (Char.isAlpha)) `apply` f in runParser (one (one (one (one (return []))))) ("abc104"++undefined) ("abc*** Exception: Parse.satisfy: failed Perhaps I should just rewrite the 'exactly' combinator to have the behaviour you desire? Its current definition is: exactly 0 p = return [] exactly n p = do x <- p xs <- exactly (n-1) p return (x:xs) and a lazier definition would be: exactly 0 p = return [] exactly n p = return (:) `apply` p `apply` exactly (n-1) p

...

How can I tell the parser that everything it parsed so far will not be invalidated by further input?

Essentially, you need to return a constructor as soon as you know that the initial portion of parsed data is correct. Often the only sensible way to do that is to use the 'apply' combinator (as shown in the examples above), returning a constructor _function_ which is lazily applied to the remainder of the parsing task.

...

I wondered whether 'commit' helps, but it didn't. (I thought it would convert a global 'fail' to a local 'error'.)

The 'commit' combinator is intended for early abortion of a parse attempt that it is known can no longer succeed. That's the opposite of what you want. By contrast, the 'apply' combinator causes a parse attempt to succeed early, even though it may turn out to fail later. Regards, Malcolm

Henning Thielemann

2:29 p.m.

On Mon, 14 May 2007, Malcolm Wallace wrote:

...

Perhaps I should just rewrite the 'exactly' combinator to have the behaviour you desire? Its current definition is:

exactly 0 p = return [] exactly n p = do x <- p xs <- exactly (n-1) p return (x:xs)

Is there a difference between 'exactly' and 'replicateM' ?

Malcolm Wallace

3:36 p.m.

Henning Thielemann wrote:

...

...
exactly 0 p = return [] exactly n p = do x <- p xs <- exactly (n-1) p return (x:xs)

Is there a difference between 'exactly' and 'replicateM' ?

With this definition, clearly not. But when rewritten to use lazy application, there is certainly a pragmatic difference in where the bottoms (if any) are located in the result. Regards, Malcolm

Henning Thielemann

2:41 p.m.

On Mon, 14 May 2007, Malcolm Wallace wrote:

...

Essentially, you need to return a constructor as soon as you know that the initial portion of parsed data is correct. Often the only sensible way to do that is to use the 'apply' combinator (as shown in the examples above), returning a constructor _function_ which is lazily applied to the remainder of the parsing task.

Great, 'apply' is the solution! I admit that I couldn't derive its power from its documentation which simply states "Apply a parsed function to a parsed value." :-)

Henning Thielemann

21 May 21 May

1:09 p.m.

On Mon, 14 May 2007, Malcolm Wallace wrote:

...

Henning Thielemann wrote:

...
...
...
*Text.ParserCombinators.PolyLazy> runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined) ("*** Exception: Parse.satisfy: failed

How can I rewrite the above example that it returns ("abc*** Exception: Parse.satisfy: failed

The problem in your example is that the 'exactly' combinator forces parsing of 'n' items before returning any of them. If you roll your own, then you can return partial results:

> let one = return (:) `apply` satisfy (Char.isAlpha) in runParser (one `apply` (one `apply` (one `apply` (one `apply` return [])))) ("abc104"++undefined) ("abc*** Exception: Parse.satisfy: failed

Equivalently:

> let one f = ((return (:)) `apply` satisfy (Char.isAlpha)) `apply` f in runParser (one (one (one (one (return []))))) ("abc104"++undefined) ("abc*** Exception: Parse.satisfy: failed

I wonder whether 'apply' merges two separate ideas: Applying a generated function to some parser generated value and forcing some parser to always succeed. From the documentation of 'apply' I assumed that 'apply f x' fails if 'f' or 'x' fails. In contrast to that it seems to succeed if only 'f' succeeds. Wouldn't it be better to have an explicit 'force' which declares a parser to never fail - and to return 'undefined' if this assumption is wrong. I have seen this 'force' in the MIDI loader of Haskore: http://darcs.haskell.org/haskore/src/Haskore/General/Parser.hs It would hold: apply f x == do g <- f; fmap g (force x)

apfelmus

12 May 12 May

10:49 a.m.

Henning Thielemann wrote:

...

I want to parse and process HTML lazily. I use HXT because the HTML parser is very liberal. However it uses Parsec and is thus strict. HaXML has a so called lazy parser, but it is not what I consider lazy: [...]

Note that lazy parsing is inherently difficult and most often more or less impossible. The fundamental problem is that one cannot really return a result before deciding whether the given input is syntactically correct or not. In other words, take the malformed HTML <html> <head></head> <body> <h1>Hello Lazy Parser </body> </html> What results should a lazy parser return before emitting ⊥? At the time you read the <html>-tag, you cannot know whether a syntax error far down in the file makes it invalid. Thus, you may not return the top-most <html>-tag until you see the closing </html>. Furthermore, the fact that data appears in sequence disturbs on-demand processing very much. For instance, take the the following XML-Tree <Node> <Node> <Leaf>1</Leaf> <Leaf>2</Leaf> </Node> <Leaf>3</Leaf> </Node> modeled after the definition data Tree a = Leaf a | Node (Tree a) (Tree a) Assume that processing only needs the left branch of the tree. In this case, we do not need to parse (Leaf 3). But assume that we only need the right branch. Much to our horror, we have to completely parse the left branch (Node (Leaf 1) (Leaf 2)) as well, simply because it appears before the right branch! I think that at least the syntax-correctness problem can be solved by a two separate parse passes: first check that the input is syntactically correct, then lazily build the values. However, monadic parser combinators are unsuitable for that because they allow the syntax correctness to depend on the built values. You need f.i. applicative parser combinators if you want to derive both passes from one parser description. Regards, apfelmus

Novák Zoltán

1:10 p.m.

Hi,

...

What results should a lazy parser return before emitting ⊥? At the time you read the <html>-tag, you cannot know whether a syntax error far down in the file makes it invalid. Thus, you may not return the top-most <html>-tag until you see the closing </html>.

But to return the top most <html> you don't have to parse the data until the </html> tag, it is really enough to see it, of course you need to read the whole file for that, but the parsing can be lazy. Zoli

neez＠freemail.hu

2:07 p.m.

Hi,

...

Hi,

...
What results should a lazy parser return before emitting ⊥? At the time you read the <html>-tag, you cannot know whether a syntax error far down in the file makes it invalid. Thus, you may not return the top-most <html>-tag until you see the closing </html>.

But to return the top most <html> you don't have to parse the data until the </html> tag, it is really enough to see it, of course you need to read the whole file for that, but the parsing can be lazy.

I've just found somebody wrote an article on a similar idea. http://citeseer.ist.psu.edu/199634.html Real lazy evaluation fans should code html in a Breadth-First way, with forward pointers. :) Zoli

6624

Age (days ago)

6634

Last active (days ago)

List overview

Download

15 comments

7 participants

participants (7)

apfelmus
Henning Thielemann
Jules Bean
Malcolm Wallace
neez＠freemail.hu
Neil Mitchell
Novák Zoltán