
This solution seem to provide a practical alternative to pusing datatypes for streaming XML. http://gemo.futurs.inria.fr/events/PLANX2008/papers/p10.pdf Lev Walkin wrote:
Marc A. Ziegert wrote:
We don't know of a good way to fix this problem. I'm going to record this example in a ticket for future reference, though. Simon,
is there a way, perhaps, to rewrite this expression to avoid leaks? An ad-hoc will do, perhaps split in two modules to avoid intramodular optimizations?
-- Lev Walkin
finally... there is a way! :D
hmm... this was a nice puzzle ;)
i've tried several times (and hours!) to implement a Continuation (not monad) based solution, but finally i developed this tricky but elegant foldr solution... i built the parser around this type: type FoldR_Builder = (TreeEvent,[UnconsumedEvent]) -> [Either [UnconsumedEvent] (Tree String)] -> [Either [UnconsumedEvent] (Tree String)]
it is based on the following thought: the tuple (rs,ps)::([Rest],[Processed]) -- with the restriction, which forces the list ps to be processed entirely before rs. is equipollent to (fmap Right ps++[Left rs])::[Either [Rest] Processed] , but the latter is easier to handle ...at least if you can't trust the GC.
Marc, you are my hero of the month!
I can't say I understood this solution before applying it back to HXML-0.2, but it surely worked and made quite observable 20% difference in performance:
9.8 seconds on my 45 megabyte XML test, running in half the space (4m) compared to my parallel version based on Ketil Malde's suggestion (which was 12 seconds on two cores (though, one core was almost idling, `par` was used purely for its side-effect)).
To those who wants to parse XML in constant space, attached find a patch to HXML-0.2 which fixes a space leak.
However, I am still a bit surprized to discover there is not an order of magnitude difference between `par`-based version and your builder.
While the foldr-based builder is clearly superior, one can't help but wonder whether is it `par` that is so efficient compared to crunching through Eithers, or there's some other bottleneck in the code. Will profile a bit later.
The XML parsing space leak was declared in HXML back in 2000 and lingered in the code for 8 years. Good riddance!
------------------------------------------------------------------------
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe