Re: [Haskell-cafe] Re: Re: hxt memory useage

29 Jan 2008

      Hi Uwe,
...
BTW: I've taken the tagsoup lib and wrote
a small parser to build a tree out of the stream
of tags. It's about a 100 lines of code.
This DOM parser does not need to read until
the closing tag to build an element node,
so it should be as lasy as possible.
A first version for HTML
already runs on my box,
but it stil needs a bit of testing
Please send a patch with whatever come up with, so others can make use
of it. I've already added Data.HTML.TagSoup.Tree to the latest darcs
version, which does as well as it can with tag matching, but is
entirely strict. Having a lazy version would be great.

I've been talking to the Java tagsoup author (http://tagsoup.info),
which does very clever processing of HTML to make it as structured and
normalised as possible. He said:
...
The schema that describes HTML can be found at
src/definitions/html.tssl in the source archive; I'll be glad to explain
any obscurities in it.
There is also some slides on his website (at the bottom) which detail
the Java TagSoup approach to reconstructing HTML, and have obviously
had a lot of thought put into them!

Thanks

Neil

Re: [Haskell-cafe] Re: Re: hxt memory useage

Neil Mitchell