
Hi One of the problems with XML parsing is nesting. Consider this fragment: <foo>lots of text</foo> The parser will naturally want to track all the way down to the closing </foo> in order to check the document is well formed, so it can put it in a tree. The problem is that means keeping "lots of text" in memory - often the entire document. TagSoup works in a lazy streaming manner, so would parse the above as: [TagOpen "foo" [], TagText "lots of text", TagClose "foo"] i.e. it hasn't matched the foo's, and can return the TagOpen before even looking at the text.
XML parsing is still slow, typically consuming 90% of the CPU time, but at least it works without blowing the heap.
I'd love TagSoup to go faster, while retaining its laziness. A basic profiling doesn't suggest anything obvious, but I may have missed something. It's more likely that it would be necessary to prod at the Core level, or move to supporting both (Lazy)ByteString and [Char]. Thanks Neil