Is XHT a good tool for parsing web pages?

John Creighton

27 Apr 2010 27 Apr '10

2:22 p.m.

Subject: Is XHT a good tool for parsing web pages? I looked a little bit at XHT and it seems very elegant for writing concise definitions of parsers by forms but I read that it fails if the XML isn't strict and I know a lot of web pages don't use strict XHTML. Therefore I wonder if it is an appropriate tool for web pages.

Attachments:

attachment.html (text/html — 542 bytes)

Show replies by date

Peter Robinson

27 Apr 27 Apr

2:26 p.m.

On 27 April 2010 16:22, John Creighton wrote:

...

...
Subject: Is XHT a good tool for parsing web pages? I looked a little bit at XHT and it seems very elegant for writing concise definitions of parsers by forms but I read that it fails if the XML isn't strict and I know a lot of web pages don't use strict XHTML. Therefore I wonder if it is an appropriate tool for web pages.

I don't know about XHT but tagsoup [1] does a pretty good job parsing web pages. Peter [1] http://hackage.haskell.org/package/tagsoup

Malcolm Wallace

8:58 p.m.

...

Is XHT a good tool for parsing web pages? I read that it fails if the XML isn't strict and I know a lot of web pages don't use strict XHTML.

Do you mean HXT rather than XHT? I know that the HaXml library has a separate error-correcting HTML parser that works around most of the common non-well-formedness bugs in HTML: Text.XML.HaXml.Html.Parse I believe HXT has a similar parser: Text.XML.HXT.Parser.HtmlParsec Indeed, some of the similarities suggest this parser was originally lifted directly out of HaXml (as permitted by HaXml's licence), although the two modules have now diverged significantly. Regards, Malcolm

Uwe Schmidt

28 Apr 28 Apr

9 a.m.

Hi John and Malcom,

...

I know that the HaXml library has a separate error-correcting HTML parser that works around most of the common non-well-formedness bugs in HTML: Text.XML.HaXml.Html.Parse

I believe HXT has a similar parser: Text.XML.HXT.Parser.HtmlParsec

Indeed, some of the similarities suggest this parser was originally lifted directly out of HaXml (as permitted by HaXml's licence), although the two modules have now diverged significantly.

The HTML parser in HXT is based on tagsoup. It's a lazy parser (it does not use parsec) and it tries to parse everything as HTML. But garbage in, garbage out, there is no approach to repair illegal HTML as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner. The table driven approach for inserting missing closing tags is indeed taken from HaXml. Malcom, I hope you don't have a patent on this algorithm. Regards, Uwe

Ivan Lazar Miljenovic

9:18 a.m.

Uwe Schmidt writes:

...

The HTML parser in HXT is based on tagsoup. It's a lazy parser (it does not use parsec) and it tries to parse everything as HTML. But garbage in, garbage out, there is no approach to repair illegal HTML as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner.

So what is parsec used for in HXT then? -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Uwe Schmidt

11:27 a.m.

Hi Ivan,

...

Uwe Schmidt writes:

...
The HTML parser in HXT is based on tagsoup. It's a lazy parser (it does not use parsec) and it tries to parse everything as HTML. But garbage in, garbage out, there is no approach to repair illegal HTML as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner.

So what is parsec used for in HXT then?

for the XML parser. This XML parser also deals with DTDs. This parser only accepts well formed XML, everything else gives an error (not just a warning like HTML parser). tagsoup and the HTML parser do not deal with DTDs, so they can't be used for a full (validating) XML parser. Regards, Uwe

5556

Age (days ago)

5557

Last active (days ago)

List overview

Download

5 comments

5 participants

participants (5)

Ivan Lazar Miljenovic
John Creighton
Malcolm Wallace
Peter Robinson
Uwe Schmidt