HTML library with DOM?

Hi all, is there an HTML parsing library that creates a DOM from a page? Günther

is there an HTML parsing library that creates a DOM from a page?
tagsoup produces trees ( http://hackage.haskell.org/package/tagsoup ) I use it with hxt ( http://hackage.haskell.org/package/hxt ) to tree-walk HTML pages. J.W.

Günther Schmidt
Hi all,
is there an HTML parsing library that creates a DOM from a page?
I've got the month of October off, and one of the things I've been
planning on working on is a compliant HTML5 parser for Haskell --
something which is sorely needed! I will ping the list back if/when I
get it finished.
G
--
Gregory Collins

Excerpts from Gregory Collins's message of Wed Oct 06 19:44:44 -0400 2010:
I've got the month of October off, and one of the things I've been planning on working on is a compliant HTML5 parser for Haskell -- something which is sorely needed! I will ping the list back if/when I get it finished.
I've heard that some of the existing HTML parsers in Haskell were already HTML5 compliant (this topic came up when I was complaining that there were some algorithms that you absolutely had to have state for, because that was how they were specified.) I never verified this assertion though. Edward

"Edward Z. Yang"
Excerpts from Gregory Collins's message of Wed Oct 06 19:44:44 -0400 2010:
I've got the month of October off, and one of the things I've been planning on working on is a compliant HTML5 parser for Haskell -- something which is sorely needed! I will ping the list back if/when I get it finished.
I've heard that some of the existing HTML parsers in Haskell were already HTML5 compliant (this topic came up when I was complaining that there were some algorithms that you absolutely had to have state for, because that was how they were specified.) I never verified this assertion though.
If there's already a library which *correctly* parses html5 documents
into DOM trees, could someone please let me know so I can use it instead
of wasting a bunch of time writing one?
Thanks,
G
--
Gregory Collins

2010/10/7 Gregory Collins
"Edward Z. Yang"
writes: Excerpts from Gregory Collins's message of Wed Oct 06 19:44:44 -0400 2010:
I've got the month of October off, and one of the things I've been planning on working on is a compliant HTML5 parser for Haskell -- something which is sorely needed! I will ping the list back if/when I get it finished.
I've heard that some of the existing HTML parsers in Haskell were already HTML5 compliant (this topic came up when I was complaining that there were some algorithms that you absolutely had to have state for, because that was how they were specified.) I never verified this assertion though.
If there's already a library which *correctly* parses html5 documents into DOM trees, could someone please let me know so I can use it instead of wasting a bunch of time writing one?
As far as I know, Neil Mitchel's tagsoup[1] parses according to the HTML 5 parsing rules, but it just generates a list of Tags[2], so you'd have to build the DOM tree up from there. I personally have had great experience with tagsoup. It's even the core of HTML-scraping technology powering searchonce[3]. Michael [1] http://hackage.haskell.org/package/tagsoup [2] http://hackage.haskell.org/packages/archive/tagsoup/0.11.1/doc/html/Text-HTM... [3] http://www.search-once.com/

Michael Snoyman
As far as I know, Neil Mitchel's tagsoup[1] parses according to the HTML 5 parsing rules, but it just generates a list of Tags[2], so you'd have to build the DOM tree up from there. I personally have had great experience with tagsoup. It's even the core of HTML-scraping technology powering searchonce[3].
Yep, someone else wrote me privately to say this (that tagsoup respects
the html5 lexing rules). So I'll be using this as the basis of an html5
DOM parser. Stay tuned!
G
--
Gregory Collins

Yes, I don't think I've officially announced a version of TagSoup that
has had HTML 5 parsing, but it now does as standard for the last few
releases. The HTML 5 spec is still changing, so it's entirely possible
something is incorrect in a corner case, but please let me know and
I'll fix it.
Thanks, Neil
2010/10/7 Gregory Collins
Michael Snoyman
writes: As far as I know, Neil Mitchel's tagsoup[1] parses according to the HTML 5 parsing rules, but it just generates a list of Tags[2], so you'd have to build the DOM tree up from there. I personally have had great experience with tagsoup. It's even the core of HTML-scraping technology powering searchonce[3].
Yep, someone else wrote me privately to say this (that tagsoup respects the html5 lexing rules). So I'll be using this as the basis of an html5 DOM parser. Stay tuned!
G -- Gregory Collins
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (6)
-
Edward Z. Yang
-
Gregory Collins
-
Günther Schmidt
-
Johannes Waldmann
-
Michael Snoyman
-
Neil Mitchell