Space leak when returning pairs?

19 May 2006

      Dear members,

I am experiencing a space leak, which I suspect to be
an instance of the problem addressed by Wadler before.
I'd appreciate if someone here would take a look.

Given the following datatype:

  data XMLEvent = StartEvent String
                | EndEvent   String
                | TextEvent  String  deriving Show

where the three constructors represent the start tag
(<a> where a is a string), end tag (</a>), and text data,
respectively, of an XML stream. (They are actually from
the library HXML). The following function
simply returns the same stream while doing a minimal
amount of validation (ignoring the closing tag).

   idX :: [XMLEvent] -> ([XMLEvent], [XMLEvent])
   idX [] = ([], [])
   idX (StartEvent a : strm) =
     let (ts, strm') = idX strm
         (us, strm'') = idX strm'
     in (StartEvent a [] : ts ++ EndEvent a : us, strm'')
   idX (EndEvent _: strm) = ([], strm)
   idX (TextEvent s : strm) =
     let (ts, strm') = idX strm
     in (TextEvent s : ts, strm')

The function idX returns a pair, where the first component
is the processed stream, while the second component is the
rest of the input. The intention is to thread the input
and release processed events.

If the function is used in a pipelined manner:

    print . fst . idX . parseInstance $ input

where parseInstance :: String -> [XMLEvent] is a
lexical analyser, I would expect that the input stream will
be consumed one by one as the output is produced, and the
program would use a constant amount of heap space (while
keeping a program stack proportional to the depth of the
XML tree). This was not the case, however. For some reason
the heap grew linearly. The GHC profiler showed that most
of the thunks are created by parseInstance, which implies
that the entire input stream somehow still resides in memory.

I was wondering where the space leak came from and suspected
that it's the leak described in one of Philip Wadler's
early paper Fixing Some Space Leaks With a Garbage Collector (1987).
Consider

     idX (StartEvent a : strm) =
       let (ts, strm') = idX strm
           (us, strm'') = idX strm'
       in (StartEvent a [] : ts ++ EndEvent a : us, strm'')

The body of the let clause might have actually been treated as

      (StartEvent a [] : ts ++
         EndEvent a : fst (idX (snd strm)),
       snd (idX (snd strm)))

Therefore strm will not be freed until the whole evaluation
finishes.

But since Wadler has addressed this problem a long time ago,
I think the fix he proposed should have been implemented in
GHC already. Was that really the source of the space leak?
If so is there a way to fix it? Or is there another reason
for the leak?

    *   *   *

The function idX above actually resulted from fusing two functions:
buildF parses a stream into a forest, while idF flattens the
tree.

    data ETree = Element String [ETree]
               | Text String

    buildF :: [XMLEvent] -> ([ETree], [XMLEvent])
    buildF [] = ([],[])
    buildF (StartEvent a : strm) =
      let (ts, strm') = buildF strm
          (us, strm'') = buildF strm'
      in (Element a ts : us, strm'')
    buildF (EndEvent _ : strm) = ([], strm)
    buildF (TextEvent s : strm) =
      let (ts, strm') = buildF strm
      in (Text s : ts, strm')

    idF :: [ETree] -> [XMLEvent]
    idF [] = []
    idF (Element a ts : us) =
      StartEvent a : idF ts ++ EndEvent a : idF us
    idF (Text s : ts) = TextEvent s : idF ts

My original program was like

     print . idF . fst . buildF . parseInstance $ input

and it had the same space leak.

By accident (assuming that the input is always a single tree),
I discovered that if I added a "wrap . head" into the pipe:

     print . idF . wrap . head . fst . buildF . parseInstance $ input

where wrap a = [a], the heap residency would reduce by half!
The output remains the same.

My explanation is that applying a "head" means that
(in the definition of buildF):

     buildF (StartEvent a : strm) =
      let (ts, strm') = buildF strm
          (us, strm'') = buildF strm'
      in (Element a ts : us, strm'')

the "us" part , which contains a reference to strm, need not
be kept. Thus the reduced heap residency.

This seems to support that the space leak resulted from
the same problem Wadler addressed before. But isn't
that solved already in GHC?

    *   *   *

I'd appreciate if someone could look into it. The actual program
is available at

    http://www.psdlab.org/~scm/hx.tar.gz

It actually uses HXML, a library by Joe (Joe English?) to
do the parsing. The main program is hxpc.hs. There is a
1 MB sized sample input file size1m.xml. There are two
simple scripts "recompile" and "runhxpc" for compiling
and running the program.

Thank you!

sincerely,
Shin-Cheng Mu

Shin-Cheng Mu

Henning Thielemann

Shin-Cheng Mu

Malcolm Wallace

Shin-Cheng Mu

Chris Kuklewicz

Malcolm Wallace

tags

participants (4)