XML (HXML) parsing :: GHC 6.8.3 space leak from 2000

Recently I had to process some multi-megabyte XML files. Tried a few Haskell XML parsers (HaXML, HXT, HXML) but all of them were exhibiting very pronounced space leaks, and all but HXML were too strict for my input. Judging by the code and stated objectives, Joe English's HXML (0.2, circa 2003) looked more promising for hacking around so I tried to figure out the space leak problem. It wasn't too long to find out the source of a problem, the buildTree function in TreeBuild.hs. In fact, the very annotation to that function reads as follows: -- %%% There is apparently a space leak here, but I can't find it. -- %%% Update 28 Feb 2000: There is a leak, but it's fixed -- %%% by a well-known GC implementation technique. Hugs 98 happens -- %%% not to implement this technique, but STG Hugs (and most other -- %%% Haskell systems) do implement it. -- %%% Thanks to Simon Peyton-Jones, Malcolm Wallace, Colin Runcinman -- %%% Mark Jones, and others for investigating this. And there's some more in the accompanying documentation: + Under Hugs 98 only, suffers a serious space fault. I wondered why would a contemporary GHC 6.8.3 exhibit such a leak? After all, the technique was known in 2000 (and afir by Wadler in '87) and one would assume Joe English's reference to "most other Haskell systems" ought to mean GHC. But here we are, in 2008 I still can't get HXML to not to leak like a hose while lazily parsing my file. In fact, I can't get my 45-megabyte file parsed on my 1GB RAM system without swapping. So I went ahead and extracted the code and stripped all XML related junk to reproduce the problem with a minimal test case. Attached please find a single tree.hs module which is just sufficient to demonstrate a memory leak. Here's a culprit function: data TreeEvent = Start String -- Branch off a new subtree | Stop -- Stop branching and return 1 level | Leaf String -- A simple leaf without children deriving Show -- Lazily build a tree out of a sequence of tree-building events build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String]) build (Start str : es) = let (es', subnodes) = build es (spill, siblings) = build es' in (spill, (Tree str subnodes : siblings)) build (Leaf str : es) = let (spill, siblings) = build es in (spill, Tree str [] : siblings) build (Stop : es) = (es, []) build [] = ([], []) In fact, the attached module implements almost verbatim the code from an old Joe's request (circa 2000): http://www.cse.unsw.edu.au/~dons/haskell-1990-2000/msg06086.html but my version is a bit more self-sufficient for the new folks who'd like to quickly test it on their system. Am I really ignorant of some important GHC optimization options (tried -O2/-O3), or is this indeed a serious problem to tackle? -- Lev Walkin vlm@lionet.info module Main where -- A simple tree to hold hierarchial XML-like data structure data Tree a = Tree a [Tree a] deriving Show -- A member of the event sequence which attempts to build a tree. -- For example, the following sequence -- [Start "top", Leaf "leaf", Start "sub", Stop, Stop] -- should correspond to the following tree: -- Tree "top" [Tree "leaf" [], Tree "sub" []] data TreeEvent = Start String -- Branch off a new subtree | Stop -- Stop branching and return 1 level | Leaf String -- A simple leaf without children deriving Show -- Lazy printing of an infinite tree building process main = print . snd . build $ Start "top" : cycle [Leaf "sub"] -- Convert a stream of tree building events -- into a list of unconsumed events and a constructed tree body. type UnconsumedEvent = TreeEvent -- Alias for program documentation build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String]) build (Start str : es) = let (es', subnodes) = build es (spill, siblings) = build es' in (spill, (Tree str subnodes : siblings)) build (Leaf str : es) = let (spill, siblings) = build es in (spill, Tree str [] : siblings) build (Stop : es) = (es, []) build [] = ([], []) -- Stricter version of build, never terminates on infinite input, -- but exhibits no space leaks whatsoever. build' f (Start str : es) = let (es', subnodes) = build' id es in build' ((Tree str subnodes) :) es' build' f (Leaf str : es) = build' ((Tree str []) :) es build' f (Stop : es) = (es, f []) build' f [] = ([], f [])

Lev Walkin
Recently I had to process some multi-megabyte XML files.
Join the club! FWIW, I ended up using tagsoup.
-- %%% There is apparently a space leak here, but I can't find it. -- %%% Update 28 Feb 2000: There is a leak, but it's fixed -- %%% by a well-known GC implementation technique.
I couldn't get this to work either. In particular, I think the GC trick should allow this without leakage: breaks p = groupBy (const (not.p)) But instead I implemented it as: breaks :: (a -> Bool) -> [a] -> [[a]] breaks p (x:xs) = let first = x : takeWhile (not.p) xs rest = dropWhile (not.p) xs in rest `par` first : if null rest then [] else breaks p rest breaks _ [] = [] With -smp, this doesn't leak. It's kind of annoying to have to rely on -smp in a library as the library cannot control how the applications get linked, but I've found no other solution. -k -- If I haven't seen further, it is by standing in the footprints of giants

Lev Walkin wrote:
I wondered why would a contemporary GHC 6.8.3 exhibit such a leak? After all, the technique was known in 2000 (and afir by Wadler in '87) and one would assume Joe English's reference to "most other Haskell systems" ought to mean GHC.
Thanks for this nice example - Don Stewart pointed me to it, and Simon PJ and I just spent some time this morning diagnosing it. Incedentally, with GHC 6.8 you can just run the program with "+RTS -hT" to get a basic space profile, there's no need to compile it for profiling - this is tremendously useful for quick profiling jobs. And in this case we see the the heap is filling up with (:) and Tree constructors, no thunks. Here's the short story: GHC does have the space leak optimisation you refer to, and it is working correctly, but it doesn't cover all the cases you might want it to cover. In particular, optimisations sometimes interact badly with the space leak avoidance, and that's what is happening here. We've known about the problem for some time, but this is the first time I've seen a nice small example that demonstrates it.
-- Lazily build a tree out of a sequence of tree-building events build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String]) build (Start str : es) = let (es', subnodes) = build es (spill, siblings) = build es' in (spill, (Tree str subnodes : siblings)) build (Leaf str : es) = let (spill, siblings) = build es in (spill, Tree str [] : siblings) build (Stop : es) = (es, []) build [] = ([], [])
So here's the long story. Look at the first equation for build:
build (Start str : es) = let (es', subnodes) = build es (spill, siblings) = build es' in (spill, (Tree str subnodes : siblings))
this turns into x = build es es' = fst x subnodes = snd x y = build es' spill = fst y siblings = snd y now, it's the "siblings" binding we're interested in, because this one is never demanded - in this example, "subnodes" ends up being an infinite list of trees, and we never get to evaluate "siblings". So anything referred to by siblings will remain in the heap. The space-leak avoidance optimisation works on all those "fst" and "snd" bindings: in a binding like "siblings = snd y", when y is evaluated to a pair, the GC will automatically reduce "snd y", so releasing the first component of the pair. This all works fine. But the optimiser sees the above code and spots that es' only occurs once, in the right hand side of the binding for y, and so it inlines it. Now we have x = build es subnodes = snd x y = build (fst x) spill = fst y siblings = snd y Now, usually this is a good idea, but in this case we lost the special space-leak avoidance on the "fst x" expression, because it is now embedded in an expression. In fact in this case the thunk goes away entirely, because build is strict. But now, when the program runs, the thunk for siblings retains y, which retains x, which evaluates to a pair, the second component of which evaluates to an infintely growing list of Trees (the first components is a chain of "fst y" expressions that constantly get reduced by the GC and don't take up any space). We don't know of a good way to fix this problem. I'm going to record this example in a ticket for future reference, though. Cheers, Simon

Simon Marlow wrote:
Lev Walkin wrote:
I wondered why would a contemporary GHC 6.8.3 exhibit such a leak? After all, the technique was known in 2000 (and afir by Wadler in '87) and one would assume Joe English's reference to "most other Haskell systems" ought to mean GHC.
Thanks for this nice example - Don Stewart pointed me to it, and Simon PJ and I just spent some time this morning diagnosing it.
Incedentally, with GHC 6.8 you can just run the program with "+RTS -hT" to get a basic space profile, there's no need to compile it for profiling - this is tremendously useful for quick profiling jobs. And in this case we see the the heap is filling up with (:) and Tree constructors, no thunks.
Here's the short story: GHC does have the space leak optimisation you refer to, and it is working correctly, but it doesn't cover all the cases you might want it to cover. In particular, optimisations sometimes interact badly with the space leak avoidance, and that's what is happening here. We've known about the problem for some time, but this is the first time I've seen a nice small example that demonstrates it.
-- Lazily build a tree out of a sequence of tree-building events build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String]) build (Start str : es) = let (es', subnodes) = build es (spill, siblings) = build es' in (spill, (Tree str subnodes : siblings)) build (Leaf str : es) = let (spill, siblings) = build es in (spill, Tree str [] : siblings) build (Stop : es) = (es, []) build [] = ([], [])
[skip]
We don't know of a good way to fix this problem. I'm going to record this example in a ticket for future reference, though.
Simon, is there a way, perhaps, to rewrite this expression to avoid leaks? An ad-hoc will do, perhaps split in two modules to avoid intramodular optimizations? -- Lev Walkin vlm@lionet.info

Lev Walkin wrote:
Simon Marlow wrote:
Lev Walkin wrote:
I wondered why would a contemporary GHC 6.8.3 exhibit such a leak? After all, the technique was known in 2000 (and afir by Wadler in '87) and one would assume Joe English's reference to "most other Haskell systems" ought to mean GHC.
Thanks for this nice example - Don Stewart pointed me to it, and Simon PJ and I just spent some time this morning diagnosing it.
Incedentally, with GHC 6.8 you can just run the program with "+RTS -hT" to get a basic space profile, there's no need to compile it for profiling - this is tremendously useful for quick profiling jobs. And in this case we see the the heap is filling up with (:) and Tree constructors, no thunks.
Here's the short story: GHC does have the space leak optimisation you refer to, and it is working correctly, but it doesn't cover all the cases you might want it to cover. In particular, optimisations sometimes interact badly with the space leak avoidance, and that's what is happening here. We've known about the problem for some time, but this is the first time I've seen a nice small example that demonstrates it.
-- Lazily build a tree out of a sequence of tree-building events build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String]) build (Start str : es) = let (es', subnodes) = build es (spill, siblings) = build es' in (spill, (Tree str subnodes : siblings)) build (Leaf str : es) = let (spill, siblings) = build es in (spill, Tree str [] : siblings) build (Stop : es) = (es, []) build [] = ([], [])
[skip]
We don't know of a good way to fix this problem. I'm going to record this example in a ticket for future reference, though.
Simon,
is there a way, perhaps, to rewrite this expression to avoid leaks? An ad-hoc will do, perhaps split in two modules to avoid intramodular optimizations?
Tried to avoid this misoptimization by using explicit fst, and it worked on my synthesized input (probably benefiting of CSE): build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String]) build (Start str : es) = let (_, subnodes) = build es (spill, siblings) = build . fst . build $ es in (spill, (Tree str subnodes : siblings)) build (Leaf str : es) = let (spill, siblings) = build es in (spill, Tree str [] : siblings) build (Stop : es) = (es, []) build [] = ([], []) However, while this solution works on a synthesized input (cycle [...]), it still has memory leak when taken into HXML environment which operates on files (why?). Only when I also added Ketil Malde's `par` based hack I finally was able to parse the big XML file without a space leak. Here's the diff to HXML 0.2: ====================================================================== --- TreeBuild.hs.old 2008-09-19 17:01:30.000000000 -0700 +++ TreeBuild.hs 2008-09-19 17:04:15.000000000 -0700 @@ -20,6 +20,7 @@ import XMLParse import XML import Tree +import Control.Parallel -- -- TODO: add basic error-checks: matching end-tags, ensure input exhausted @@ -43,8 +44,9 @@ addTree t es = let (s,es') = build es in pair (cons t s) es' build [] = pair nil [] build (e:es) = case e of - StartEvent gi atts -> let (c,es') = build es - in addNode (ELNode gi atts) c es' + StartEvent gi atts -> let (c, es') = build es + sbl = build . snd . build $ es + in sbl `par` (cons (tree (ELNode gi atts) c) (fst sbl), snd sbl) EndEvent _ -> pair nil es EmptyEvent gi atts -> addLeaf (ELNode gi atts) es TextEvent s -> addLeaf (TXNode s) es ======================================================================= With that, a 45 mb XML is parsed in constant space in G4 1.5GHz: 1 minute 48 seconds, taking 16 mb RAM Pentium D 2x3.0GHz: 12 seconds, taking 9 mb RAM Compared to 0.2s `wc -l`. If you * remove `par` from there or * replace (build . snd . build $ es) with just (es') or * forget to specify -threaded (-smp) during ghc compilation then the space leak will exhibit itself again. However, removing -threaded will still make this code run without leak on synthesized input (StartEvent "" [] : cycle [TextEvent ""]). I believe there's a way to get rid of `par`, perhaps by wrapping this tree building thing into a optimization-unfriendly monad? But I don't know how to approach this. Any help? -- Lev Walkin vlm@lionet.info

-- Lazily build a tree out of a sequence of tree-building events build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String]) build (Start str : es) = let (es', subnodes) = build es (spill, siblings) = build es' in (spill, (Tree str subnodes : siblings)) build (Leaf str : es) = let (spill, siblings) = build es in (spill, Tree str [] : siblings) build (Stop : es) = (es, []) build [] = ([], [])
[skip]
We don't know of a good way to fix this problem. I'm going to record this example in a ticket for future reference, though.
Simon,
is there a way, perhaps, to rewrite this expression to avoid leaks? An ad-hoc will do, perhaps split in two modules to avoid intramodular optimizations?
-- Lev Walkin
finally... there is a way! :D hmm... this was a nice puzzle ;) i've tried several times (and hours!) to implement a Continuation (not monad) based solution, but finally i developed this tricky but elegant foldr solution... i built the parser around this type: type FoldR_Builder = (TreeEvent,[UnconsumedEvent]) -> [Either [UnconsumedEvent] (Tree String)] -> [Either [UnconsumedEvent] (Tree String)] it is based on the following thought: the tuple (rs,ps)::([Rest],[Processed]) -- with the restriction, which forces the list ps to be processed entirely before rs. is equipollent to (fmap Right ps++[Left rs])::[Either [Rest] Processed] , but the latter is easier to handle ...at least if you can't trust the GC. - marc ---------------example_context_free_grammar_parser.hs-------------------------- module Main where import Data.List data Tree a = Tree a [Tree a] deriving Show data TreeEvent = Start String -- Branch off a new subtree | Stop -- Stop branching and return 1 level | Leaf String -- A simple leaf without children deriving Show main = print . snd . build $ Start "top" : cycle [Leaf "sub"] --main = print . snd . build $ [Leaf "bla",Leaf "bla",Start "S(",Leaf "bli",Start "T(",Leaf "blu",Stop,Stop,Leaf "bla"] type UnconsumedEvent = TreeEvent -- Alias for program documentation build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String]) build tes = let (ts_,ue_,_) = splitAtLeftDefault [] $ foldr builder [] [(te,ue)|ue@(te:_)<-tails tes] in (ue_,ts_) -- ^^^^^^^^^ -- a little change (bugfix?) to the space leaking solution... -- [Stop,Leaf "x"] now evaluates to ([Stop,Leaf "x"],[]) instead of ([Leaf "x"],[]) -- like this: build ue@(Stop:_) = (ue,[]) -- instead of: build (Stop : es) = (es,[]) type FoldR_Builder = (TreeEvent,[UnconsumedEvent]) -> [Either [UnconsumedEvent] (Tree String)] -> [Either [UnconsumedEvent] (Tree String)] builder :: FoldR_Builder builder (Stop,ue) euts = (Left ue:euts) builder (Leaf str,_) euts = (Right (Tree str []):euts) builder (Start str,_) euts = let (sub,_,euts') = splitAtLeftDefault [] euts in (Right (Tree str sub):euts') -- default value is needed iff the list is finite and contains no (Left _). splitAtLeftDefault :: a -> [Either a b] -> ([b],a,[Either a b]) splitAtLeftDefault a0 [] = ([],a0,[]) splitAtLeftDefault a0 (Right b:xs) = let (bs,a,es) = splitAtLeftDefault a0 xs in (b:bs,a,es) splitAtLeftDefault _ (Left a:xs) = ([],a,xs)

Marc A. Ziegert wrote:
We don't know of a good way to fix this problem. I'm going to record this example in a ticket for future reference, though. Simon,
is there a way, perhaps, to rewrite this expression to avoid leaks? An ad-hoc will do, perhaps split in two modules to avoid intramodular optimizations?
-- Lev Walkin
finally... there is a way! :D
hmm... this was a nice puzzle ;)
i've tried several times (and hours!) to implement a Continuation (not monad) based solution, but finally i developed this tricky but elegant foldr solution... i built the parser around this type: type FoldR_Builder = (TreeEvent,[UnconsumedEvent]) -> [Either [UnconsumedEvent] (Tree String)] -> [Either [UnconsumedEvent] (Tree String)]
it is based on the following thought: the tuple (rs,ps)::([Rest],[Processed]) -- with the restriction, which forces the list ps to be processed entirely before rs. is equipollent to (fmap Right ps++[Left rs])::[Either [Rest] Processed] , but the latter is easier to handle ...at least if you can't trust the GC.
Marc, you are my hero of the month! I can't say I understood this solution before applying it back to HXML-0.2, but it surely worked and made quite observable 20% difference in performance: 9.8 seconds on my 45 megabyte XML test, running in half the space (4m) compared to my parallel version based on Ketil Malde's suggestion (which was 12 seconds on two cores (though, one core was almost idling, `par` was used purely for its side-effect)). To those who wants to parse XML in constant space, attached find a patch to HXML-0.2 which fixes a space leak. However, I am still a bit surprized to discover there is not an order of magnitude difference between `par`-based version and your builder. While the foldr-based builder is clearly superior, one can't help but wonder whether is it `par` that is so efficient compared to crunching through Eithers, or there's some other bottleneck in the code. Will profile a bit later. The XML parsing space leak was declared in HXML back in 2000 and lingered in the code for 8 years. Good riddance! -- Lev Walkin vlm@lionet.info --- TreeBuild.hs.old 2008-09-23 05:48:50.000000000 -0700 +++ TreeBuild.hs 2008-09-23 05:49:37.000000000 -0700 @@ -20,6 +20,7 @@ import XMLParse import XML import Tree +import Data.List (tails) -- -- TODO: add basic error-checks: matching end-tags, ensure input exhausted @@ -31,28 +32,29 @@ -- %%% Haskell systems) do implement it. -- %%% Thanks to Simon Peyton-Jones, Malcolm Wallace, Colin Runcinman -- %%% Mark Jones, and others for investigating this. +-- %%% Update 23 Sep 2008: Leak-free solution is provided by Marc A. Ziegert buildTree :: [XMLEvent] -> Tree XMLNode buildTree = constructTree Tree (:) [] constructTree :: (XMLNode -> f -> t) -> (t -> f -> f) -> f -> [XMLEvent] -> t constructTree tree cons nil events = let - pair x y = (x,y) - addNode nd children es = addTree (tree nd children) es - addLeaf nd es = addTree (tree nd nil) es - addTree t es = let (s,es') = build es in pair (cons t s) es' - build [] = pair nil [] - build (e:es) = case e of - StartEvent gi atts -> let (c,es') = build es - in addNode (ELNode gi atts) c es' - EndEvent _ -> pair nil es - EmptyEvent gi atts -> addLeaf (ELNode gi atts) es - TextEvent s -> addLeaf (TXNode s) es - PIEvent tgt val -> addLeaf (PINode tgt val) es - CommentEvent txt -> addLeaf (CXNode txt) es - GERefEvent name -> addLeaf (ENNode name) es - ErrorEvent s -> error s -- %%% deal with this - in tree RTNode (fst (build events)) + -- Marc A. Ziegert has provided a leek-free solution + build tes = let (ts_,ue_,_) = splitAtLeftDefault [] $ foldr builder [] [(te,ue)|ue@(te:_)<-tails tes] in ts_ + builder (EndEvent _,ue) euts = (Left ue:euts) + builder (EmptyEvent gi atts,_) euts = (Right (tree (ELNode gi atts) nil):euts) + builder (TextEvent str,_) euts = (Right (tree (TXNode str) nil):euts) + builder (PIEvent tgt val,_) euts = (Right (tree (PINode tgt val) nil):euts) + builder (CommentEvent txt,_) euts = (Right (tree (CXNode txt) nil):euts) + builder (GERefEvent name,_) euts = (Right (tree (ENNode name) nil):euts) + builder (ErrorEvent s,_) euts = error s -- %%% deal with this + builder (StartEvent gi atts,_) euts = let (sub,_,euts') = splitAtLeftDefault [] euts + in (Right (tree (ELNode gi atts) sub):euts') + splitAtLeftDefault a0 [] = (nil,a0,[]) + splitAtLeftDefault a0 (Right b:xs) = + let (bs,a,es) = splitAtLeftDefault a0 xs in (cons b bs,a,es) + splitAtLeftDefault _ (Left a:xs) = (nil,a,xs) + in tree RTNode (build events) serializeTree :: Tree XMLNode -> [XMLEvent] serializeTree tree = sn tree [] where

This solution seem to provide a practical alternative to pusing datatypes for streaming XML. http://gemo.futurs.inria.fr/events/PLANX2008/papers/p10.pdf Lev Walkin wrote:
Marc A. Ziegert wrote:
We don't know of a good way to fix this problem. I'm going to record this example in a ticket for future reference, though. Simon,
is there a way, perhaps, to rewrite this expression to avoid leaks? An ad-hoc will do, perhaps split in two modules to avoid intramodular optimizations?
-- Lev Walkin
finally... there is a way! :D
hmm... this was a nice puzzle ;)
i've tried several times (and hours!) to implement a Continuation (not monad) based solution, but finally i developed this tricky but elegant foldr solution... i built the parser around this type: type FoldR_Builder = (TreeEvent,[UnconsumedEvent]) -> [Either [UnconsumedEvent] (Tree String)] -> [Either [UnconsumedEvent] (Tree String)]
it is based on the following thought: the tuple (rs,ps)::([Rest],[Processed]) -- with the restriction, which forces the list ps to be processed entirely before rs. is equipollent to (fmap Right ps++[Left rs])::[Either [Rest] Processed] , but the latter is easier to handle ...at least if you can't trust the GC.
Marc, you are my hero of the month!
I can't say I understood this solution before applying it back to HXML-0.2, but it surely worked and made quite observable 20% difference in performance:
9.8 seconds on my 45 megabyte XML test, running in half the space (4m) compared to my parallel version based on Ketil Malde's suggestion (which was 12 seconds on two cores (though, one core was almost idling, `par` was used purely for its side-effect)).
To those who wants to parse XML in constant space, attached find a patch to HXML-0.2 which fixes a space leak.
However, I am still a bit surprized to discover there is not an order of magnitude difference between `par`-based version and your builder.
While the foldr-based builder is clearly superior, one can't help but wonder whether is it `par` that is so efficient compared to crunching through Eithers, or there's some other bottleneck in the code. Will profile a bit later.
The XML parsing space leak was declared in HXML back in 2000 and lingered in the code for 8 years. Good riddance!
------------------------------------------------------------------------
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (4)
-
Ketil Malde
-
Lev Walkin
-
Marc A. Ziegert
-
Simon Marlow