
Dear members, I am experiencing a space leak, which I suspect to be an instance of the problem addressed by Wadler before. I'd appreciate if someone here would take a look. Given the following datatype: data XMLEvent = StartEvent String | EndEvent String | TextEvent String deriving Show where the three constructors represent the start tag (<a> where a is a string), end tag (</a>), and text data, respectively, of an XML stream. (They are actually from the library HXML). The following function simply returns the same stream while doing a minimal amount of validation (ignoring the closing tag). idX :: [XMLEvent] -> ([XMLEvent], [XMLEvent]) idX [] = ([], []) idX (StartEvent a : strm) = let (ts, strm') = idX strm (us, strm'') = idX strm' in (StartEvent a [] : ts ++ EndEvent a : us, strm'') idX (EndEvent _: strm) = ([], strm) idX (TextEvent s : strm) = let (ts, strm') = idX strm in (TextEvent s : ts, strm') The function idX returns a pair, where the first component is the processed stream, while the second component is the rest of the input. The intention is to thread the input and release processed events. If the function is used in a pipelined manner: print . fst . idX . parseInstance $ input where parseInstance :: String -> [XMLEvent] is a lexical analyser, I would expect that the input stream will be consumed one by one as the output is produced, and the program would use a constant amount of heap space (while keeping a program stack proportional to the depth of the XML tree). This was not the case, however. For some reason the heap grew linearly. The GHC profiler showed that most of the thunks are created by parseInstance, which implies that the entire input stream somehow still resides in memory. I was wondering where the space leak came from and suspected that it's the leak described in one of Philip Wadler's early paper Fixing Some Space Leaks With a Garbage Collector (1987). Consider idX (StartEvent a : strm) = let (ts, strm') = idX strm (us, strm'') = idX strm' in (StartEvent a [] : ts ++ EndEvent a : us, strm'') The body of the let clause might have actually been treated as (StartEvent a [] : ts ++ EndEvent a : fst (idX (snd strm)), snd (idX (snd strm))) Therefore strm will not be freed until the whole evaluation finishes. But since Wadler has addressed this problem a long time ago, I think the fix he proposed should have been implemented in GHC already. Was that really the source of the space leak? If so is there a way to fix it? Or is there another reason for the leak? * * * The function idX above actually resulted from fusing two functions: buildF parses a stream into a forest, while idF flattens the tree. data ETree = Element String [ETree] | Text String buildF :: [XMLEvent] -> ([ETree], [XMLEvent]) buildF [] = ([],[]) buildF (StartEvent a : strm) = let (ts, strm') = buildF strm (us, strm'') = buildF strm' in (Element a ts : us, strm'') buildF (EndEvent _ : strm) = ([], strm) buildF (TextEvent s : strm) = let (ts, strm') = buildF strm in (Text s : ts, strm') idF :: [ETree] -> [XMLEvent] idF [] = [] idF (Element a ts : us) = StartEvent a : idF ts ++ EndEvent a : idF us idF (Text s : ts) = TextEvent s : idF ts My original program was like print . idF . fst . buildF . parseInstance $ input and it had the same space leak. By accident (assuming that the input is always a single tree), I discovered that if I added a "wrap . head" into the pipe: print . idF . wrap . head . fst . buildF . parseInstance $ input where wrap a = [a], the heap residency would reduce by half! The output remains the same. My explanation is that applying a "head" means that (in the definition of buildF): buildF (StartEvent a : strm) = let (ts, strm') = buildF strm (us, strm'') = buildF strm' in (Element a ts : us, strm'') the "us" part , which contains a reference to strm, need not be kept. Thus the reduced heap residency. This seems to support that the space leak resulted from the same problem Wadler addressed before. But isn't that solved already in GHC? * * * I'd appreciate if someone could look into it. The actual program is available at http://www.psdlab.org/~scm/hx.tar.gz It actually uses HXML, a library by Joe (Joe English?) to do the parsing. The main program is hxpc.hs. There is a 1 MB sized sample input file size1m.xml. There are two simple scripts "recompile" and "runhxpc" for compiling and running the program. Thank you! sincerely, Shin-Cheng Mu

On Fri, 19 May 2006, Shin-Cheng Mu wrote:
idX :: [XMLEvent] -> ([XMLEvent], [XMLEvent]) idX [] = ([], []) idX (StartEvent a : strm) = let (ts, strm') = idX strm (us, strm'') = idX strm' in (StartEvent a [] : ts ++ EndEvent a : us, strm'') idX (EndEvent _: strm) = ([], strm) idX (TextEvent s : strm) = let (ts, strm') = idX strm in (TextEvent s : ts, strm')
let ~(ts, strm') = idX strm ~(us, strm'') = idX strm' ?

Dear Henning, On May 19, 2006, at 6:16 PM, Henning Thielemann wrote:
On Fri, 19 May 2006, Shin-Cheng Mu wrote:
idX :: [XMLEvent] -> ([XMLEvent], [XMLEvent]) idX (StartEvent a : strm) = let (ts, strm') = idX strm (us, strm'') = idX strm' in (StartEvent a [] : ts ++ EndEvent a : us, strm'') let ~(ts, strm') = idX strm ~(us, strm'') = idX strm'
Oh, I just tried using lazy patterns for the case for StartEvent and TextEvent. But the space behaviour remained the same. :( sincerely, Shin-Cheng Mu

On May 19, 2006, at 6:16 PM, Henning Thielemann wrote:
let ~(ts, strm') = idX strm ~(us, strm'') = idX strm'
I seem to have found a partial solution to the problem. It's rather ugly, however, and I think there should be a better way. The original definition for one of the clauses was like this: idX (StartEvent a : strm) = let (ts, strm') = idX strm (us, strm'') = idX strm' in (StartEvent a [] : ts ++ EndEvent a : us, strm'') The intention was to thread strm through the recursive calls and free the items in strm as soon as they are processed. However, with this definition I needed about 8Mb of heap for a 1Mb input file. Since I suspected that delayed evaluation of us and strm'' keeps the pointer to strm around for too long, the natural solution seems to be forcing their evaluation. So I tried: idX (StartEvent a : strm) = case idX strm of (ts, strm') -> case idX strm' of (us, strm'') -> (StartEvent a [] : ts ++ EndEvent a : us, strm'') which made things worse. The heap size increased a bit more and I get a typical "bell" shaped heap profile suggesting idX cannot output anything before processing the whole input. So I have to allow idX to output something while consuming the input. What eventually worked was this messy mixture of let and case: idX (StartEvent a : strm) = let (xs, rest) = case idX strm of (ts, strm') -> let (ws, rest) = case idX strm' of (us, strm'') -> (us, strm'') in (ts ++ EndEvent a : ws, rest) in (StartEvent a [] : xs, rest) The heap residency reduced from about 8MB to around 80Kb. This is rather tricky, however. The following "nicer looking" version has a heap profile as bad as the worse case. idX (StartEvent a : strm) = let (ts,us,strm'') = case idXsp strm of (ts, strm') -> case idXsp strm' of (us, strm'') -> (ts, us, strm'') in (StartEvent a [] : ts ++ EndEvent a : us, strm'') And I cannot quite explain why (Anyone?). Is there a more structured solution? Can I leave the hard work to the compiler? sincerely, Shin-Cheng Mu

Shin-Cheng Mu wrote:
Dear members,
I am experiencing a space leak, which I suspect to be an instance of the problem addressed by Wadler before. I'd appreciate if someone here would take a look.
Given the following datatype:
data XMLEvent = StartEvent String | EndEvent String | TextEvent String deriving Show
where the three constructors represent the start tag (<a> where a is a string), end tag (</a>), and text data, respectively, of an XML stream. (They are actually from the library HXML). The following function simply returns the same stream while doing a minimal amount of validation (ignoring the closing tag).
idX :: [XMLEvent] -> ([XMLEvent], [XMLEvent]) idX [] = ([], []) idX (StartEvent a : strm) = let (ts, strm') = idX strm (us, strm'') = idX strm' in (StartEvent a [] : ts ++ EndEvent a : us, strm'') idX (EndEvent _: strm) = ([], strm) idX (TextEvent s : strm) = let (ts, strm') = idX strm in (TextEvent s : ts, strm')
The function idX returns a pair, where the first component is the processed stream, while the second component is the rest of the input. The intention is to thread the input and release processed events.
If the function is used in a pipelined manner:
print . fst . idX . parseInstance $ input
I would not have written idX in this manner. My version is
data XMLEvent = StartEvent String | EndEvent String | TextEvent String deriving Show
idX' :: [XMLEvent] -> [XMLEvent] idX' = untilTags []
untilTags :: [String] -> [XMLEvent] -> [XMLEvent] untilTags [] [] = [] untilTags tags [] = error ("Did not find closing EndEvents " ++ show tags) untilTags tags (x:xs) = case x of StartEvent tag' -> x : untilTags (tag':tags) xs EndEvent tag' -> if null tags then error ("Unexpected EndEvent with tag " ++ show tag') else let (tag:tags') = tags in if tag == tag' then x : untilTags tags' xs else error ("Expected EndEvents " ++ show tags ++ " but found EndEvent " ++ show tag') TextEvent _ -> x : untilTags tags xs
Here the flow of the input "x : ..." to the output "x : ..." is more obvious, and the open tags are kept in a stack and used to check the closing tags. The memory usage will be only slightly higher than your idX due to keeping the list of open tag names. If you want to remove that overhead and assume the ending closing tags have correct strings, then you can keep a mere count of open tags:
countTags :: Int -> [XMLEvent] -> [XMLEvent] countTags 0 [] = [] countTags n [] = error ("Did not find the closing EndEvents " ++ show n) countTags n (x:xs) = case x of StartEvent tag' -> x : countTags (succ n) xs EndEvent tag' -> if 0 == n then error ("Unexpected EndEvent with tag " ++ show tag') else x : countTags (pred n) xs TextEvent _ -> x : countTags n xs

Shin-Cheng Mu
I was wondering where the space leak came from and suspected that it's the leak described in one of Philip Wadler's early paper Fixing Some Space Leaks With a Garbage Collector (1987).
But since Wadler has addressed this problem a long time ago, I think the fix he proposed should have been implemented in GHC already. Was that really the source of the space leak? If so is there a way to fix it? Or is there another reason for the leak?
I compiled and profiled your code using nhc98 instead of ghc. There was no space leak visible. Since nhc98 does implement the Wadler GC fix (with refinements by Sparud), I can only assume that this does indeed account for the space improvement over ghc. Regards, Malcolm
participants (4)
-
Chris Kuklewicz
-
Henning Thielemann
-
Malcolm Wallace
-
Shin-Cheng Mu