Re: [Haskell-cafe] Memory leak in streaming parser

2 Apr 2007

      On Mon, 2007-04-02 at 13:54 +0100, Malcolm Wallace wrote:
...
An observation about your state setter functions, ...
You can shorten your code considerably by using the standard named-field
update syntax for exactly this task:
setDecision :: String -> State -> State
  setDecision decision state = state { sDecision = decision }
If I do that, the run time insists on the state being "more" evaluated
before it changes that specific field. This kills streaming, enforcing
each production (including the top one) to be fully parsed before I
can access its generated tokens. So the GC won't be hanging on to
State objects, but memory still explodes - with unconsumed Token
objects. And there's no output from the program until it dies :-(
...
Not only is it shorter, but it will often be much more efficient, since
the entire structured value is copied once once, then a single field
updated, rather than being re-built piece-by-piece in 15 steps.
I know! Is there an efficient way to lazily modify just one field record?
...
You probably want to be strict in the state component, but not in the
output values of your monad.  So as well as replacing
    let ... in (finalState, rightResult)
with
    let ... in finalState  `seq`  (finalState, rightResult)
in the (>>=) method in your Monad instance (and in the separate defn of
For some strange reason, adding this didn't solve the problem - the GC
still refuses to collect the state objects. BTW, forcing the
evaluation of the intermediate states (originalState, leftState,
rightState etc.) doesn't help either.

I have tried to ensure that when '>>=' and '/' will allow the GC to
discard old states "as soon as possible", but I'm obviously missing
something. Is there a way to get more detailed retainer information
than what's available with '-hr'?
...
you might also need to make all the named fields of your State datatype
strict.
If I make any of them strict, streaming goes away :-(

Writing a streaming parser in Haskell is turning out to be much harder
than I originally expected. Every fix I tried so far either broke
streaming (memory blows up due to tokens) or had no effect (memory
blows up due to states). I am assuming that there's a magic point in
the middle where tokens are consumed and states are GC-ed... but it
has eluded me so far.

Thanks,

    Oren Ben-Kiki

P.S. I uploaded the package to Hackage. I added a debug-leak
production to make it easier to profile this with even less
productions involved.  ``yes '#' | yaml2yeast -p debug-leak''.