
Hi Uwe,
BTW: I've taken the tagsoup lib and wrote a small parser to build a tree out of the stream of tags. It's about a 100 lines of code. This DOM parser does not need to read until the closing tag to build an element node, so it should be as lasy as possible. A first version for HTML already runs on my box, but it stil needs a bit of testing
Please send a patch with whatever come up with, so others can make use of it. I've already added Data.HTML.TagSoup.Tree to the latest darcs version, which does as well as it can with tag matching, but is entirely strict. Having a lazy version would be great. I've been talking to the Java tagsoup author (http://tagsoup.info), which does very clever processing of HTML to make it as structured and normalised as possible. He said:
The schema that describes HTML can be found at src/definitions/html.tssl in the source archive; I'll be glad to explain any obscurities in it.
There is also some slides on his website (at the bottom) which detail the Java TagSoup approach to reconstructing HTML, and have obviously had a lot of thought put into them! Thanks Neil