Lazy HTML parsing with HXT, HaXML/polyparse, what else?

I want to parse and process HTML lazily. I use HXT because the HTML parser is very liberal. However it uses Parsec and is thus strict. HaXML has a so called lazy parser, but it is not what I consider lazy: *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>" *** Exception: Prelude.undefined *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>" *** Exception: Expected "" but found & at file text at line 1 col 26 If it would be lazy, it would return some HTML code before the error. HaXML uses the Polyparse package for parsing which contains a so called lazy parser. However it has return type (Either String a). That is, for the decision whether the parse was successful, the document has to be parsed completely. *Text.ParserCombinators.PolyLazy> runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined) ("*** Exception: Parse.satisfy: failed If it would have return type (String, a) it could return both a partial value of type 'a' and the error message as String. It would be even better if it has some handling for incorrect input texts, and returns ([String], a), where [String] is the type of a list of warnings and error messages and 'a' is the type of a total value of parse output. Is there some parser of this type? Unfortunately http://www.haskell.org/haskellwiki/Applications_and_libraries/Compiler_tools does not compare the laziness of the mentioned parsers.

Hi
Depending on exactly what you want, TagSoup may be of interest to you.
It is lazy, but it doesn't return a tree. It is very tollerant of
errors, and will simply never "fail to parse" something.
http://www-users.cs.york.ac.uk/~ndm/tagsoup/
Thanks
Neil
On 5/11/07, Henning Thielemann
I want to parse and process HTML lazily. I use HXT because the HTML parser is very liberal. However it uses Parsec and is thus strict. HaXML has a so called lazy parser, but it is not what I consider lazy:
*Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>" *** Exception: Prelude.undefined *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>" *** Exception: Expected "" but found & at file text at line 1 col 26
If it would be lazy, it would return some HTML code before the error. HaXML uses the Polyparse package for parsing which contains a so called lazy parser. However it has return type (Either String a). That is, for the decision whether the parse was successful, the document has to be parsed completely.
*Text.ParserCombinators.PolyLazy> runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined) ("*** Exception: Parse.satisfy: failed
If it would have return type (String, a) it could return both a partial value of type 'a' and the error message as String. It would be even better if it has some handling for incorrect input texts, and returns ([String], a), where [String] is the type of a list of warnings and error messages and 'a' is the type of a total value of parse output.
Is there some parser of this type? Unfortunately http://www.haskell.org/haskellwiki/Applications_and_libraries/Compiler_tools does not compare the laziness of the mentioned parsers. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Fri, 11 May 2007, Neil Mitchell wrote:
Depending on exactly what you want, TagSoup may be of interest to you. It is lazy, but it doesn't return a tree. It is very tollerant of errors, and will simply never "fail to parse" something.
That's an interesting option. It could be used as a lexer for a full-blown HTML parser. Sometimes I need the tree structure. But why does this simple piece of code needs -fglasgow-exts? Thanks for the package and the hint! Henning

Hi
That's an interesting option. It could be used as a lexer for a full-blown HTML parser. Sometimes I need the tree structure. But why does this simple piece of code needs -fglasgow-exts?
It doesn't. The released version 0.1 doesn't require extensions, and the next 0.2 won't either. In the meantime I accepted a patch from a user that added a new feature and required the flag. I'm going to rework it shortly, make a few tweaks, and remove that flag. Thanks Neil

Henning Thielemann wrote:
I want to parse and process HTML lazily. I use HXT because the HTML parser is very liberal. However it uses Parsec and is thus strict. HaXML has a so called lazy parser, but it is not what I consider lazy:
*Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>" *** Exception: Prelude.undefined *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>" *** Exception: Expected "" but found & at file text at line 1 col 26
If it would be lazy, it would return some HTML code before the error.
Are you sure that it is the parser, that is not lazy, and it isn't that the pretty printer is overly strict? From the evidence above the parser could be returning some results before the error, and the pretty printer strictly slurping it all up to the error and then dying. Jules

On Fri, 11 May 2007, Jules Bean wrote:
Henning Thielemann wrote:
I want to parse and process HTML lazily. I use HXT because the HTML parser is very liberal. However it uses Parsec and is thus strict. HaXML has a so called lazy parser, but it is not what I consider lazy:
*Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>" *** Exception: Prelude.undefined *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>" *** Exception: Expected "" but found & at file text at line 1 col 26
If it would be lazy, it would return some HTML code before the error.
Are you sure that it is the parser, that is not lazy, and it isn't that the pretty printer is overly strict?
From the evidence above the parser could be returning some results before the error, and the pretty printer strictly slurping it all up to the error and then dying.
I know, but the type of the Polyparse parser prohibits lazy parsing. Unfortunately there is no Show instance for HaXML trees, so one cannot easily see whether laziness gets lost in the parser or in the pretty printer.

Henning Thielemann
HaXml has a so called lazy parser, but it is not what I consider lazy:
Lazy parsing is rather subtle, and it is easy to write a too-strict parser when one intended to be more lazy. Equally, it can be easy to imagine that the parser is too strict, when in fact it is the usage context that is wrong. You have indeed found some bugs in HaXml's lazy HTML parser, but you have also partly misunderstood what lazy parsing means.
Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>" *** Exception: Prelude.undefined
The problem here is not that the parser is too strict, but that the pretty-printer is. The pretty-printer is demanding an undefined portion of the value before it produces any output.
If it would be lazy, it would return some HTML code before the error.
It can do, but only if you consume the part without errors first! For instance, it would be safe to extract just the <head> tag, because that is complete: import Text.XML.HaXml import Text.XML.HaXml.Posn import Text.XML.HaXml.Html.ParseLazy as Lazy import Text.XML.HaXml.Pretty as PP import Text.PrettyPrint.HughesPJ main = putStrLn $ render $ fsep PP.content $ -- the following line extracts just the first child tag of <html> (\(Document _ _ e _)-> (position 0 children) (CElem e nopos)) $ Lazy.htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>" Unfortunately, this program currently does throw an "undefined" exception, even though it should not. The lazy HTML parser contains a couple of tricky corners that _probably_ stop it from being lazy. (1) The element parser does not immediately return an element after seeing its start tag, because it also has to return a stack of improperly terminated elements inside this one (so they can be repaired). (2) After parsing, we simplify the tree structure, which of course traverses it, and may again force too much evaluation. In any case, I will need to investigate further, and hopefully soon push a patch to fix the problem.
HaXML uses the Polyparse package for parsing which contains a so called lazy parser. However it has return type (Either String a). That is, for the decision whether the parse was successful, the document has to be parsed completely.
Not true. PolyLazy.runParser has the signature runParser :: Parser t a -> [t] -> (a, [t]) that is, it returns the partially parsed value (which may contain bottoms), and the remaining unparsed token-stream. (Examining either of these return values may cause sufficient evaluation to be forced to lead to a runtime exception.) There is no 'Either' type at the user level. (Although an Either is used internally, see below, it does not do what you think).
*Text.ParserCombinators.PolyLazy> runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined) ("*** Exception: Parse.satisfy: failed
This output is exactly correct. You asked for the first four characters provided that they were alphabetic, but in fact only the first three were alphabetic. Hence, 'satisfy' failed and threw an exception. If you ask for only the first three characters, then the parse succeeds:
fst $ runParser (exactly 3 (satisfy Char.isAlpha)) ("abc104"++undefined) "abc"
The purpose of the internal Either type that you mentioned, is to permit backtracking within the parse, not to force complete evaluation. Thus, you can equally ask for the first four characters provided they are alphanumeric, where alphanumeric is decided by a combination of alternate parsers:
fst $ runParser (exactly 4 (satisfy Char.isAlpha `onFail` satisfy Char.isDigit)) ("abc104"++undefined) "abc1"
This example illustrates that a parse failure is still recoverable when parsing lazily (but only by another parser, not once the failure has escaped the parsing world). Regards, Malcolm

On Fri, 11 May 2007, Malcolm Wallace wrote:
*Text.ParserCombinators.PolyLazy> runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined) ("*** Exception: Parse.satisfy: failed
This output is exactly correct. You asked for the first four characters provided that they were alphabetic, but in fact only the first three were alphabetic. Hence, 'satisfy' failed and threw an exception. If you ask for only the first three characters, then the parse succeeds:
The problem is obviously that a later wrong character can make the whole parse fail. Thus successful generated data is not returned until the whole input is parsed and checked. How can I suppress checking the whole input? How can I tell the parser that everything it parsed so far will not be invalidated by further input? How can I rewrite the above example that it returns ("abc*** Exception: Parse.satisfy: failed ? I wondered whether 'commit' helps, but it didn't. (I thought it would convert a global 'fail' to a local 'error'.) *Text.ParserCombinators.PolyLazy> runParser (exactly 4 (commit (satisfy Char.isAlpha))) ("abc104"++undefined) *** Exception: Parse.satisfy: failed

Henning Thielemann
*Text.ParserCombinators.PolyLazy> runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined) ("*** Exception: Parse.satisfy: failed
How can I rewrite the above example that it returns ("abc*** Exception: Parse.satisfy: failed
The problem in your example is that the 'exactly' combinator forces parsing of 'n' items before returning any of them. If you roll your own, then you can return partial results: > let one = return (:) `apply` satisfy (Char.isAlpha) in runParser (one `apply` (one `apply` (one `apply` (one `apply` return [])))) ("abc104"++undefined) ("abc*** Exception: Parse.satisfy: failed Equivalently: > let one f = ((return (:)) `apply` satisfy (Char.isAlpha)) `apply` f in runParser (one (one (one (one (return []))))) ("abc104"++undefined) ("abc*** Exception: Parse.satisfy: failed Perhaps I should just rewrite the 'exactly' combinator to have the behaviour you desire? Its current definition is: exactly 0 p = return [] exactly n p = do x <- p xs <- exactly (n-1) p return (x:xs) and a lazier definition would be: exactly 0 p = return [] exactly n p = return (:) `apply` p `apply` exactly (n-1) p
How can I tell the parser that everything it parsed so far will not be invalidated by further input?
Essentially, you need to return a constructor as soon as you know that the initial portion of parsed data is correct. Often the only sensible way to do that is to use the 'apply' combinator (as shown in the examples above), returning a constructor _function_ which is lazily applied to the remainder of the parsing task.
I wondered whether 'commit' helps, but it didn't. (I thought it would convert a global 'fail' to a local 'error'.)
The 'commit' combinator is intended for early abortion of a parse attempt that it is known can no longer succeed. That's the opposite of what you want. By contrast, the 'apply' combinator causes a parse attempt to succeed early, even though it may turn out to fail later. Regards, Malcolm

On Mon, 14 May 2007, Malcolm Wallace wrote:
Perhaps I should just rewrite the 'exactly' combinator to have the behaviour you desire? Its current definition is:
exactly 0 p = return [] exactly n p = do x <- p xs <- exactly (n-1) p return (x:xs)
Is there a difference between 'exactly' and 'replicateM' ?

Henning Thielemann
exactly 0 p = return [] exactly n p = do x <- p xs <- exactly (n-1) p return (x:xs)
Is there a difference between 'exactly' and 'replicateM' ?
With this definition, clearly not. But when rewritten to use lazy application, there is certainly a pragmatic difference in where the bottoms (if any) are located in the result. Regards, Malcolm

On Mon, 14 May 2007, Malcolm Wallace wrote:
Essentially, you need to return a constructor as soon as you know that the initial portion of parsed data is correct. Often the only sensible way to do that is to use the 'apply' combinator (as shown in the examples above), returning a constructor _function_ which is lazily applied to the remainder of the parsing task.
Great, 'apply' is the solution! I admit that I couldn't derive its power from its documentation which simply states "Apply a parsed function to a parsed value." :-)

On Mon, 14 May 2007, Malcolm Wallace wrote:
Henning Thielemann
wrote: *Text.ParserCombinators.PolyLazy> runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined) ("*** Exception: Parse.satisfy: failed
How can I rewrite the above example that it returns ("abc*** Exception: Parse.satisfy: failed
The problem in your example is that the 'exactly' combinator forces parsing of 'n' items before returning any of them. If you roll your own, then you can return partial results:
> let one = return (:) `apply` satisfy (Char.isAlpha) in runParser (one `apply` (one `apply` (one `apply` (one `apply` return [])))) ("abc104"++undefined) ("abc*** Exception: Parse.satisfy: failed
Equivalently:
> let one f = ((return (:)) `apply` satisfy (Char.isAlpha)) `apply` f in runParser (one (one (one (one (return []))))) ("abc104"++undefined) ("abc*** Exception: Parse.satisfy: failed
I wonder whether 'apply' merges two separate ideas: Applying a generated function to some parser generated value and forcing some parser to always succeed. From the documentation of 'apply' I assumed that 'apply f x' fails if 'f' or 'x' fails. In contrast to that it seems to succeed if only 'f' succeeds. Wouldn't it be better to have an explicit 'force' which declares a parser to never fail - and to return 'undefined' if this assumption is wrong. I have seen this 'force' in the MIDI loader of Haskore: http://darcs.haskell.org/haskore/src/Haskore/General/Parser.hs It would hold: apply f x == do g <- f; fmap g (force x)

Henning Thielemann wrote:
I want to parse and process HTML lazily. I use HXT because the HTML parser is very liberal. However it uses Parsec and is thus strict. HaXML has a so called lazy parser, but it is not what I consider lazy: [...]
Note that lazy parsing is inherently difficult and most often more or less impossible. The fundamental problem is that one cannot really return a result before deciding whether the given input is syntactically correct or not. In other words, take the malformed HTML <html> <head></head> <body> <h1>Hello Lazy Parser </body> </html> What results should a lazy parser return before emitting ⊥? At the time you read the <html>-tag, you cannot know whether a syntax error far down in the file makes it invalid. Thus, you may not return the top-most <html>-tag until you see the closing </html>. Furthermore, the fact that data appears in sequence disturbs on-demand processing very much. For instance, take the the following XML-Tree <Node> <Node> <Leaf>1</Leaf> <Leaf>2</Leaf> </Node> <Leaf>3</Leaf> </Node> modeled after the definition data Tree a = Leaf a | Node (Tree a) (Tree a) Assume that processing only needs the left branch of the tree. In this case, we do not need to parse (Leaf 3). But assume that we only need the right branch. Much to our horror, we have to completely parse the left branch (Node (Leaf 1) (Leaf 2)) as well, simply because it appears before the right branch! I think that at least the syntax-correctness problem can be solved by a two separate parse passes: first check that the input is syntactically correct, then lazily build the values. However, monadic parser combinators are unsuitable for that because they allow the syntax correctness to depend on the built values. You need f.i. applicative parser combinators if you want to derive both passes from one parser description. Regards, apfelmus

Hi,
What results should a lazy parser return before emitting ⊥? At the time you read the <html>-tag, you cannot know whether a syntax error far down in the file makes it invalid. Thus, you may not return the top-most <html>-tag until you see the closing </html>.
But to return the top most <html> you don't have to parse the data until the </html> tag, it is really enough to see it, of course you need to read the whole file for that, but the parsing can be lazy. Zoli

Hi,
Hi,
What results should a lazy parser return before emitting ⊥? At the time you read the <html>-tag, you cannot know whether a syntax error far down in the file makes it invalid. Thus, you may not return the top-most <html>-tag until you see the closing </html>.
But to return the top most <html> you don't have to parse the data until the </html> tag, it is really enough to see it, of course you need to read the whole file for that, but the parsing can be lazy.
I've just found somebody wrote an article on a similar idea. http://citeseer.ist.psu.edu/199634.html Real lazy evaluation fans should code html in a Breadth-First way, with forward pointers. :) Zoli
participants (7)
-
apfelmus
-
Henning Thielemann
-
Jules Bean
-
Malcolm Wallace
-
neez@freemail.hu
-
Neil Mitchell
-
Novák Zoltán