
Dear members, I am experiencing a space leak, which I suspect to be an instance of the problem addressed by Wadler before. I'd appreciate if someone here would take a look. Given the following datatype: data XMLEvent = StartEvent String | EndEvent String | TextEvent String deriving Show where the three constructors represent the start tag (<a> where a is a string), end tag (</a>), and text data, respectively, of an XML stream. (They are actually from the library HXML). The following function simply returns the same stream while doing a minimal amount of validation (ignoring the closing tag). idX :: [XMLEvent] -> ([XMLEvent], [XMLEvent]) idX [] = ([], []) idX (StartEvent a : strm) = let (ts, strm') = idX strm (us, strm'') = idX strm' in (StartEvent a [] : ts ++ EndEvent a : us, strm'') idX (EndEvent _: strm) = ([], strm) idX (TextEvent s : strm) = let (ts, strm') = idX strm in (TextEvent s : ts, strm') The function idX returns a pair, where the first component is the processed stream, while the second component is the rest of the input. The intention is to thread the input and release processed events. If the function is used in a pipelined manner: print . fst . idX . parseInstance $ input where parseInstance :: String -> [XMLEvent] is a lexical analyser, I would expect that the input stream will be consumed one by one as the output is produced, and the program would use a constant amount of heap space (while keeping a program stack proportional to the depth of the XML tree). This was not the case, however. For some reason the heap grew linearly. The GHC profiler showed that most of the thunks are created by parseInstance, which implies that the entire input stream somehow still resides in memory. I was wondering where the space leak came from and suspected that it's the leak described in one of Philip Wadler's early paper Fixing Some Space Leaks With a Garbage Collector (1987). Consider idX (StartEvent a : strm) = let (ts, strm') = idX strm (us, strm'') = idX strm' in (StartEvent a [] : ts ++ EndEvent a : us, strm'') The body of the let clause might have actually been treated as (StartEvent a [] : ts ++ EndEvent a : fst (idX (snd strm)), snd (idX (snd strm))) Therefore strm will not be freed until the whole evaluation finishes. But since Wadler has addressed this problem a long time ago, I think the fix he proposed should have been implemented in GHC already. Was that really the source of the space leak? If so is there a way to fix it? Or is there another reason for the leak? * * * The function idX above actually resulted from fusing two functions: buildF parses a stream into a forest, while idF flattens the tree. data ETree = Element String [ETree] | Text String buildF :: [XMLEvent] -> ([ETree], [XMLEvent]) buildF [] = ([],[]) buildF (StartEvent a : strm) = let (ts, strm') = buildF strm (us, strm'') = buildF strm' in (Element a ts : us, strm'') buildF (EndEvent _ : strm) = ([], strm) buildF (TextEvent s : strm) = let (ts, strm') = buildF strm in (Text s : ts, strm') idF :: [ETree] -> [XMLEvent] idF [] = [] idF (Element a ts : us) = StartEvent a : idF ts ++ EndEvent a : idF us idF (Text s : ts) = TextEvent s : idF ts My original program was like print . idF . fst . buildF . parseInstance $ input and it had the same space leak. By accident (assuming that the input is always a single tree), I discovered that if I added a "wrap . head" into the pipe: print . idF . wrap . head . fst . buildF . parseInstance $ input where wrap a = [a], the heap residency would reduce by half! The output remains the same. My explanation is that applying a "head" means that (in the definition of buildF): buildF (StartEvent a : strm) = let (ts, strm') = buildF strm (us, strm'') = buildF strm' in (Element a ts : us, strm'') the "us" part , which contains a reference to strm, need not be kept. Thus the reduced heap residency. This seems to support that the space leak resulted from the same problem Wadler addressed before. But isn't that solved already in GHC? * * * I'd appreciate if someone could look into it. The actual program is available at http://www.psdlab.org/~scm/hx.tar.gz It actually uses HXML, a library by Joe (Joe English?) to do the parsing. The main program is hxpc.hs. There is a 1 MB sized sample input file size1m.xml. There are two simple scripts "recompile" and "runhxpc" for compiling and running the program. Thank you! sincerely, Shin-Cheng Mu