Is XHT a good tool for parsing web pages?

Subject: Is XHT a good tool for parsing web pages? I looked a little bit at XHT and it seems very elegant for writing concise definitions of parsers by forms but I read that it fails if the XML isn't strict and I know a lot of web pages don't use strict XHTML. Therefore I wonder if it is an appropriate tool for web pages.

On 27 April 2010 16:22, John Creighton
Subject: Is XHT a good tool for parsing web pages? I looked a little bit at XHT and it seems very elegant for writing concise definitions of parsers by forms but I read that it fails if the XML isn't strict and I know a lot of web pages don't use strict XHTML. Therefore I wonder if it is an appropriate tool for web pages.
I don't know about XHT but tagsoup [1] does a pretty good job parsing web pages. Peter [1] http://hackage.haskell.org/package/tagsoup

Is XHT a good tool for parsing web pages? I read that it fails if the XML isn't strict and I know a lot of web pages don't use strict XHTML.
Do you mean HXT rather than XHT? I know that the HaXml library has a separate error-correcting HTML parser that works around most of the common non-well-formedness bugs in HTML: Text.XML.HaXml.Html.Parse I believe HXT has a similar parser: Text.XML.HXT.Parser.HtmlParsec Indeed, some of the similarities suggest this parser was originally lifted directly out of HaXml (as permitted by HaXml's licence), although the two modules have now diverged significantly. Regards, Malcolm

Hi John and Malcom,
I know that the HaXml library has a separate error-correcting HTML parser that works around most of the common non-well-formedness bugs in HTML: Text.XML.HaXml.Html.Parse
I believe HXT has a similar parser: Text.XML.HXT.Parser.HtmlParsec
Indeed, some of the similarities suggest this parser was originally lifted directly out of HaXml (as permitted by HaXml's licence), although the two modules have now diverged significantly.
The HTML parser in HXT is based on tagsoup. It's a lazy parser (it does not use parsec) and it tries to parse everything as HTML. But garbage in, garbage out, there is no approach to repair illegal HTML as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner. The table driven approach for inserting missing closing tags is indeed taken from HaXml. Malcom, I hope you don't have a patent on this algorithm. Regards, Uwe

Uwe Schmidt
The HTML parser in HXT is based on tagsoup. It's a lazy parser (it does not use parsec) and it tries to parse everything as HTML. But garbage in, garbage out, there is no approach to repair illegal HTML as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner.
So what is parsec used for in HXT then? -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Hi Ivan,
Uwe Schmidt
writes: The HTML parser in HXT is based on tagsoup. It's a lazy parser (it does not use parsec) and it tries to parse everything as HTML. But garbage in, garbage out, there is no approach to repair illegal HTML as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner.
So what is parsec used for in HXT then?
for the XML parser. This XML parser also deals with DTDs. This parser only accepts well formed XML, everything else gives an error (not just a warning like HTML parser). tagsoup and the HTML parser do not deal with DTDs, so they can't be used for a full (validating) XML parser. Regards, Uwe
participants (5)
-
Ivan Lazar Miljenovic
-
John Creighton
-
Malcolm Wallace
-
Peter Robinson
-
Uwe Schmidt