
On Tue, 2007-04-10 at 12:14 +0200, apfelmus wrote:
Oren Ben-Kiki wrote:
The code is in http://hpaste.org/1314#a1 if anyone at all is willing to take pity on me and explain what I'm doing wrong.
There is an important point to note about streaming, namely that it conflicts with knowing whether the input is syntactically correct or not.
True, this is the core issue. Think of a turing machine processing an infinite tape. It is writing output and producing a final result; it is possible to examine the output tape before knowing the final result. Haskell parsers insist on having the "output tape" live in memory until the final result is known, and then give it to you as one huge object when done (or discard it and just give you the error message). NOT a good idea if your input tape is a 200 MB file!
You've probably noticed this already and introduced rToken for that reason but I'm not sure.
The idea is that the parser generates a stream of tokens. If/when it hits an error, you get an error token at the end and parsing stops.
Chasing down my memory leak I got into a weird situation where adding a magic manual SCC section and compiling with -prof makes the leak disappear.
That sounds fishy. Note that manual SCC annotations + optimization currently is a bit buggy in GHC, in the sense that optimization disturbs correct cost attribution.
Er... I just added '-prof'. No '-O' flag at all so no optimizations, right?
Does the program eat the output of yes 'a' without memory explosion showed by "top"? If so, it could perhaps be a problem of the optimization phase which maybe expands
D.append parsed (case result of ...)
to
case result of { Nothing -> D.append parsed D.empty Just .. -> D.append parsed (D.singleton...) }
The latter won't stream because of the reason noted above.
Ah, but it *does* stream! This is the beauty of lazy evaluation. The "free" code is basically: reply = parse ... -- Lazily evaluated tokens = rTokens reply -- Has some values "immediately" list = D.toList tokens -- Has some values "immediately" mapM_ list print -- Start printing "immediately"! Where what "parse" does is: reply = Reply { rTokens = D.append concreteToken lazyTokens, -- "Immediate" rResult = lazyResult -- Only available at end of parsing } So every time the parser calls "D.append" into the tokens, the printing "D.toList" is able to extract, print and GC the token immediately. And this works perfectly, with constant memory consumption. The problem occurs when I peek at the final "rResult" field. The "leak" code says: reply = parse ... -- Lazily evaluated result = rResult reply -- Lazy; has value when parsing is done extra = case result ... -- Lazy; has value when parsing is done parsed = rTokens reply -- Has some values "immediately" tokens = D.append parsed extra -- Has some values "immediately" list = D.toList tokens -- Has some values "immediately" mapM_ list print -- Starts printing "immediately"! This *still* starts printing tokens immediately. However, while in the previous case the GC is smart enough to keep the program in constant memory size, in the second case it for some reasons starts missing more and more PAP objects so memory usage goes through the roof.
But it's highly unlikely that GHC performs a transformation that changes semantics this radically.
You'd think... but the fact of the matter is that while the first version works fine, the second doesn't, UNLESS I add the magic SCC section: extra = {-# SCC "magic" #-} case result ... And compile with '-prof' (no '-O' flags). Then it somehow, finally, "get the idea" and the program runs perfectly well with constant memory consumption. Which, as you aptly put it, is very "fishy" indeed...
I can achieve the results I want in very short elegant code...
In my opinion, your streaming parser is not as elegant as it could be which makes it hard to read the code.
Well... one point is the above basically establishes two "threads", one printing tokens and one producing them. communicating through a sort of message queue (rTokens). I think that's pretty elegant compared to the hoops you need to jump through to do this in any other language :-) Another point is that this is a dumbed-down toy example. In the real program the parser does much more which justifies the awkwardness you point out. Specifically there are nifty ways to handle decision points (parsing isn't much good unless there are alternatives to be tested at each point :-).
With monad transformers, we almost have
Parser a ~= StateT State (WriterT (D.DList Token) Maybe) a ~= State -> (D.DList Token, Maybe (a,State))
Monad transformers are a bit beyond my grasp at this point, but from the little I know about them I don't see how they would help me with the GC/memory problem. They definitely might make the code even more elegant...
Also, I'm really missing type signatures, especially for many. When reading the code, I expected
many :: Parser a -> Parser [a]
but had to conclude
many :: Parser a -> Parser ()
by inspecting its code.
Sorry about that. Yes, all the "syntax" functions have this signature. I added these as www.hpaste.org/1314#a2. Is it possible I have hit on a bug in GHC's implementation, and adding '-prof' somehow caused it to work around it? Thanks, Oren Ben-Kiki