[Haskell-cafe] Memory leak in streaming parser

1 Apr 2007

      I just created an initial version of a "streaming" parser. This parser
is intended to serve as a reference parser for the YAML spec. Efficiency
isn't the main concern; the priority was to have it use the BNF
productions from the spec without any changes (ok, with tiny, minor,
insignificant changes :-) and to allow it to stream large files.

I'm happy to say I achieved the first goal, and the parser .hs file
simply #includes a file that is, for all intents and purposes, a BNF
file. I didn't have as much luck with the second goal.

After putting a lot of effort into it, I have got to the point it is
"streaming". That is, I can "cat" a large YAML file to the wrapper
program and it will "immediately" start spitting out parsed tokens. This
took some doing to ensure the resulting token stream is accessible while
it is being generated, in the presence of potential parsing failures
and, of course, parsing decision points.

I must have done "too good a job" converting things to lazy form
because, while the parser is streaming, it also hangs on to old state
objects for no "obvious" reason. At least it isn't obvious to me after
profiling it in any way I could think of and trying to `seq` the
"obvious" suspects. Adding a `seq` in the wrong place will of course
stop it from being a streaming parser...

I'd take it as a kindness if someone who had some deeper knowledge of
the Haskell internals would peek at it. It is packaged as Cabal .tar.gz
file in http://www.ben-kiki.org/oren/YamlReference - it includes the
wrapper program and a regression-test program. To watch it consume all
available memory using a very small set of BNF productions,

        yes '#' | yaml2yeast -p s-l-comments

(This basically matches "l-comment*" - each '#\n' is a comment line).
You'll get a stream of parsed tokens to stdout and an ever-climbing
memory usage in top (or htop).

Any advice would be appreciated,

        Oren Ben-Kiki