Memory leak in streaming parser

I just created an initial version of a "streaming" parser. This parser is intended to serve as a reference parser for the YAML spec. Efficiency isn't the main concern; the priority was to have it use the BNF productions from the spec without any changes (ok, with tiny, minor, insignificant changes :-) and to allow it to stream large files. I'm happy to say I achieved the first goal, and the parser .hs file simply #includes a file that is, for all intents and purposes, a BNF file. I didn't have as much luck with the second goal. After putting a lot of effort into it, I have got to the point it is "streaming". That is, I can "cat" a large YAML file to the wrapper program and it will "immediately" start spitting out parsed tokens. This took some doing to ensure the resulting token stream is accessible while it is being generated, in the presence of potential parsing failures and, of course, parsing decision points. I must have done "too good a job" converting things to lazy form because, while the parser is streaming, it also hangs on to old state objects for no "obvious" reason. At least it isn't obvious to me after profiling it in any way I could think of and trying to `seq` the "obvious" suspects. Adding a `seq` in the wrong place will of course stop it from being a streaming parser... I'd take it as a kindness if someone who had some deeper knowledge of the Haskell internals would peek at it. It is packaged as Cabal .tar.gz file in http://www.ben-kiki.org/oren/YamlReference - it includes the wrapper program and a regression-test program. To watch it consume all available memory using a very small set of BNF productions, yes '#' | yaml2yeast -p s-l-comments (This basically matches "l-comment*" - each '#\n' is a comment line). You'll get a stream of parsed tokens to stdout and an ever-climbing memory usage in top (or htop). Any advice would be appreciated, Oren Ben-Kiki

"Oren Ben-Kiki"
I just created an initial version of a "streaming" parser. This parser is intended to serve as a reference parser for the YAML spec.
An observation about your state setter functions, e.g. setDecision :: String -> State -> State setDecision decision state = State { sName = state|>sName, sEncoding = state|>sEncoding, sDecision = decision, sLimit = state|>sLimit, sForbidden = state|>sForbidden, sIsPeek = state|>sIsPeek, sTokens = state|>sTokens, sCommits = state|>sCommits, sConsumed = state|>sConsumed, sChars = state|>sChars, sMessage = state|>sMessage, sLine = state|>sLine, sColumn = state|>sColumn, sCode = state|>sCode, sLast = state|>sLast, sInput = state|>sInput } You can shorten your code considerably by using the standard named-field update syntax for exactly this task: setDecision :: String -> State -> State setDecision decision state = state { sDecision = decision } Not only is it shorter, but it will often be much more efficient, since the entire structured value is copied once once, then a single field updated, rather than being re-built piece-by-piece in 15 steps.
I must have done "too good a job" converting things to lazy form because, while the parser is streaming, it also hangs on to old state objects for no "obvious" reason. At least it isn't obvious to me after profiling it in any way I could think of and trying to `seq` the "obvious" suspects. Adding a `seq` in the wrong place will of course stop it from being a streaming parser...
You probably want to be strict in the state component, but not in the output values of your monad. So as well as replacing let ... in (finalState, rightResult) with let ... in finalState `seq` (finalState, rightResult) in the (>>=) method in your Monad instance (and in the separate defn of /), you might also need to make all the named fields of your State datatype strict. Regards, Malcolm
participants (2)
-
Malcolm Wallace
-
Oren Ben-Kiki