
Recently I had to process some multi-megabyte XML files. Tried a few Haskell XML parsers (HaXML, HXT, HXML) but all of them were exhibiting very pronounced space leaks, and all but HXML were too strict for my input. Judging by the code and stated objectives, Joe English's HXML (0.2, circa 2003) looked more promising for hacking around so I tried to figure out the space leak problem. It wasn't too long to find out the source of a problem, the buildTree function in TreeBuild.hs. In fact, the very annotation to that function reads as follows: -- %%% There is apparently a space leak here, but I can't find it. -- %%% Update 28 Feb 2000: There is a leak, but it's fixed -- %%% by a well-known GC implementation technique. Hugs 98 happens -- %%% not to implement this technique, but STG Hugs (and most other -- %%% Haskell systems) do implement it. -- %%% Thanks to Simon Peyton-Jones, Malcolm Wallace, Colin Runcinman -- %%% Mark Jones, and others for investigating this. And there's some more in the accompanying documentation: + Under Hugs 98 only, suffers a serious space fault. I wondered why would a contemporary GHC 6.8.3 exhibit such a leak? After all, the technique was known in 2000 (and afir by Wadler in '87) and one would assume Joe English's reference to "most other Haskell systems" ought to mean GHC. But here we are, in 2008 I still can't get HXML to not to leak like a hose while lazily parsing my file. In fact, I can't get my 45-megabyte file parsed on my 1GB RAM system without swapping. So I went ahead and extracted the code and stripped all XML related junk to reproduce the problem with a minimal test case. Attached please find a single tree.hs module which is just sufficient to demonstrate a memory leak. Here's a culprit function: data TreeEvent = Start String -- Branch off a new subtree | Stop -- Stop branching and return 1 level | Leaf String -- A simple leaf without children deriving Show -- Lazily build a tree out of a sequence of tree-building events build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String]) build (Start str : es) = let (es', subnodes) = build es (spill, siblings) = build es' in (spill, (Tree str subnodes : siblings)) build (Leaf str : es) = let (spill, siblings) = build es in (spill, Tree str [] : siblings) build (Stop : es) = (es, []) build [] = ([], []) In fact, the attached module implements almost verbatim the code from an old Joe's request (circa 2000): http://www.cse.unsw.edu.au/~dons/haskell-1990-2000/msg06086.html but my version is a bit more self-sufficient for the new folks who'd like to quickly test it on their system. Am I really ignorant of some important GHC optimization options (tried -O2/-O3), or is this indeed a serious problem to tackle? -- Lev Walkin vlm@lionet.info module Main where -- A simple tree to hold hierarchial XML-like data structure data Tree a = Tree a [Tree a] deriving Show -- A member of the event sequence which attempts to build a tree. -- For example, the following sequence -- [Start "top", Leaf "leaf", Start "sub", Stop, Stop] -- should correspond to the following tree: -- Tree "top" [Tree "leaf" [], Tree "sub" []] data TreeEvent = Start String -- Branch off a new subtree | Stop -- Stop branching and return 1 level | Leaf String -- A simple leaf without children deriving Show -- Lazy printing of an infinite tree building process main = print . snd . build $ Start "top" : cycle [Leaf "sub"] -- Convert a stream of tree building events -- into a list of unconsumed events and a constructed tree body. type UnconsumedEvent = TreeEvent -- Alias for program documentation build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String]) build (Start str : es) = let (es', subnodes) = build es (spill, siblings) = build es' in (spill, (Tree str subnodes : siblings)) build (Leaf str : es) = let (spill, siblings) = build es in (spill, Tree str [] : siblings) build (Stop : es) = (es, []) build [] = ([], []) -- Stricter version of build, never terminates on infinite input, -- but exhibits no space leaks whatsoever. build' f (Start str : es) = let (es', subnodes) = build' id es in build' ((Tree str subnodes) :) es' build' f (Leaf str : es) = build' ((Tree str []) :) es build' f (Stop : es) = (es, f []) build' f [] = ([], f [])