
I just created an initial version of a "streaming" parser. This parser is intended to serve as a reference parser for the YAML spec. Efficiency isn't the main concern; the priority was to have it use the BNF productions from the spec without any changes (ok, with tiny, minor, insignificant changes :-) and to allow it to stream large files. I'm happy to say I achieved the first goal, and the parser .hs file simply #includes a file that is, for all intents and purposes, a BNF file. I didn't have as much luck with the second goal. After putting a lot of effort into it, I have got to the point it is "streaming". That is, I can "cat" a large YAML file to the wrapper program and it will "immediately" start spitting out parsed tokens. This took some doing to ensure the resulting token stream is accessible while it is being generated, in the presence of potential parsing failures and, of course, parsing decision points. I must have done "too good a job" converting things to lazy form because, while the parser is streaming, it also hangs on to old state objects for no "obvious" reason. At least it isn't obvious to me after profiling it in any way I could think of and trying to `seq` the "obvious" suspects. Adding a `seq` in the wrong place will of course stop it from being a streaming parser... I'd take it as a kindness if someone who had some deeper knowledge of the Haskell internals would peek at it. It is packaged as Cabal .tar.gz file in http://www.ben-kiki.org/oren/YamlReference - it includes the wrapper program and a regression-test program. To watch it consume all available memory using a very small set of BNF productions, yes '#' | yaml2yeast -p s-l-comments (This basically matches "l-comment*" - each '#\n' is a comment line). You'll get a stream of parsed tokens to stdout and an ever-climbing memory usage in top (or htop). Any advice would be appreciated, Oren Ben-Kiki