Profiling makes memory leak go away? Is Haskell a practical language?

On Tue, 2007-04-10 at 12:14 +0200, apfelmus wrote:
Oren Ben-Kiki wrote:
The code is in http://hpaste.org/1314#a1 if anyone at all is willing to take pity on me and explain what I'm doing wrong.
There is an important point to note about streaming, namely that it conflicts with knowing whether the input is syntactically correct or not.
True, this is the core issue. Think of a turing machine processing an infinite tape. It is writing output and producing a final result; it is possible to examine the output tape before knowing the final result. Haskell parsers insist on having the "output tape" live in memory until the final result is known, and then give it to you as one huge object when done (or discard it and just give you the error message). NOT a good idea if your input tape is a 200 MB file!
You've probably noticed this already and introduced rToken for that reason but I'm not sure.
The idea is that the parser generates a stream of tokens. If/when it hits an error, you get an error token at the end and parsing stops.
Chasing down my memory leak I got into a weird situation where adding a magic manual SCC section and compiling with -prof makes the leak disappear.
That sounds fishy. Note that manual SCC annotations + optimization currently is a bit buggy in GHC, in the sense that optimization disturbs correct cost attribution.
Er... I just added '-prof'. No '-O' flag at all so no optimizations, right?
Does the program eat the output of yes 'a' without memory explosion showed by "top"? If so, it could perhaps be a problem of the optimization phase which maybe expands
D.append parsed (case result of ...)
to
case result of { Nothing -> D.append parsed D.empty Just .. -> D.append parsed (D.singleton...) }
The latter won't stream because of the reason noted above.
Ah, but it *does* stream! This is the beauty of lazy evaluation. The "free" code is basically: reply = parse ... -- Lazily evaluated tokens = rTokens reply -- Has some values "immediately" list = D.toList tokens -- Has some values "immediately" mapM_ list print -- Start printing "immediately"! Where what "parse" does is: reply = Reply { rTokens = D.append concreteToken lazyTokens, -- "Immediate" rResult = lazyResult -- Only available at end of parsing } So every time the parser calls "D.append" into the tokens, the printing "D.toList" is able to extract, print and GC the token immediately. And this works perfectly, with constant memory consumption. The problem occurs when I peek at the final "rResult" field. The "leak" code says: reply = parse ... -- Lazily evaluated result = rResult reply -- Lazy; has value when parsing is done extra = case result ... -- Lazy; has value when parsing is done parsed = rTokens reply -- Has some values "immediately" tokens = D.append parsed extra -- Has some values "immediately" list = D.toList tokens -- Has some values "immediately" mapM_ list print -- Starts printing "immediately"! This *still* starts printing tokens immediately. However, while in the previous case the GC is smart enough to keep the program in constant memory size, in the second case it for some reasons starts missing more and more PAP objects so memory usage goes through the roof.
But it's highly unlikely that GHC performs a transformation that changes semantics this radically.
You'd think... but the fact of the matter is that while the first version works fine, the second doesn't, UNLESS I add the magic SCC section: extra = {-# SCC "magic" #-} case result ... And compile with '-prof' (no '-O' flags). Then it somehow, finally, "get the idea" and the program runs perfectly well with constant memory consumption. Which, as you aptly put it, is very "fishy" indeed...
I can achieve the results I want in very short elegant code...
In my opinion, your streaming parser is not as elegant as it could be which makes it hard to read the code.
Well... one point is the above basically establishes two "threads", one printing tokens and one producing them. communicating through a sort of message queue (rTokens). I think that's pretty elegant compared to the hoops you need to jump through to do this in any other language :-) Another point is that this is a dumbed-down toy example. In the real program the parser does much more which justifies the awkwardness you point out. Specifically there are nifty ways to handle decision points (parsing isn't much good unless there are alternatives to be tested at each point :-).
With monad transformers, we almost have
Parser a ~= StateT State (WriterT (D.DList Token) Maybe) a ~= State -> (D.DList Token, Maybe (a,State))
Monad transformers are a bit beyond my grasp at this point, but from the little I know about them I don't see how they would help me with the GC/memory problem. They definitely might make the code even more elegant...
Also, I'm really missing type signatures, especially for many. When reading the code, I expected
many :: Parser a -> Parser [a]
but had to conclude
many :: Parser a -> Parser ()
by inspecting its code.
Sorry about that. Yes, all the "syntax" functions have this signature. I added these as www.hpaste.org/1314#a2. Is it possible I have hit on a bug in GHC's implementation, and adding '-prof' somehow caused it to work around it? Thanks, Oren Ben-Kiki

reply = parse ... -- Lazily evaluated tokens = rTokens reply -- Has some values "immediately" list = D.toList tokens -- Has some values "immediately" mapM_ list print -- Start printing "immediately"! .. reply = parse ... -- Lazily evaluated result = rResult reply -- Lazy; has value when parsing is done extra = case result ... -- Lazy; has value when parsing is done parsed = rTokens reply -- Has some values "immediately" tokens = D.append parsed extra -- Has some values "immediately" list = D.toList tokens -- Has some values "immediately" mapM_ list print -- Starts printing "immediately"!
This *still* starts printing tokens immediately. However, while in the previous case the GC is smart enough to keep the program in constant memory size, in the second case it for some reasons starts missing more and more PAP objects so memory usage goes through the roof.
is that a nail for this hammer, perhaps?-) http://hackage.haskell.org/trac/ghc/ticket/917 if you don't use rResult reply, reply can be used and freed as it is used. if you do use rResult reply, you are going to use it late, and that is going to hang on to reply, which is being expanded by the main thread of activities (rTokens). i'm just guessing here, but if that is indeed the problem, you would need to exert more control over what is evaluated when and shared where: - evaluate rResult synchronously with rTokens, instead of rResult long after rTokens has unfolded the reply - evaluate rResult independent of rTokens, on a separate copy of reply since you want to use parts of the output before you can be sure whether the whole input is correct, you might also want local errors instead of global ones (i've seen a correct chunk of input, here is the corresponding chunk of output; instead of here is a list of output chunks i've produced so far, i'll tell you later whether they are worth anything or whether they were based on invalid input).
You'd think... but the fact of the matter is that while the first version works fine, the second doesn't, UNLESS I add the magic SCC section:
extra = {-# SCC "magic" #-} case result ...
And compile with '-prof' (no '-O' flags). Then it somehow, finally, "get the idea" and the program runs perfectly well with constant memory consumption. Which, as you aptly put it, is very "fishy" indeed...
adding profiling might (another wild guess here..) lose sharing, just as in the ticket, i used \()->[..] to avoid sharing of the list. (although that guess wouldn't necessarily suggest this particular SCC to be useful, so perhaps it is the wrong track..) hth, claus

Claus Reinke wrote:
i'm just guessing here, but if that is indeed the problem, you would need to exert more control over what is evaluated when and shared where:
- evaluate rResult synchronously with rTokens, instead of rResult long after rTokens has unfolded the reply - evaluate rResult independent of rTokens, on a separate copy of reply
Brandon Michael Moore wrote:
It's the same problem you see in
--argument to break sharing input () = 'a' : input ()
main = let text = input() in putStr (text ++ [last text])
Ah, of course. That's what's going on:
result = rResult reply -- Lazy; has value when parsing is done extra = case result ... -- Lazy; has value when parsing is done parsed = rTokens reply -- Has some values "immediately"
On evaluating 'parsed', the DList of tokens gets expanded. But by keeping a reference to 'reply' via 'result', this expanded list has to be kept in memory! I mean, the definition of 'result' formally specifies that it depends on the hole 'reply' and the GC may not throw away parts of it. This is similar to the described space leaks in http://citeseer.ist.psu.edu/sparud93fixing.html I believe that a single deconstruction of the reply as in let tokens = case args of ["free"] -> rTokens reply ["leak"] -> case reply of Reply { rResult = result, rTokens = parsed } -> let extra = maybe (D.singleton $ Token {tText='!'}) (const D.empty) result in D.append parsed extra is a cure for your space leak. Regards, apfelmus

On Tue, Apr 10, 2007 at 11:03:32AM -0700, Oren Ben-Kiki wrote:
On Tue, 2007-04-10 at 12:14 +0200, apfelmus wrote:
Oren Ben-Kiki wrote:
The code is in http://hpaste.org/1314#a1 if anyone at all is willing to take pity on me and explain what I'm doing wrong.
There is an important point to note about streaming, namely that it conflicts with knowing whether the input is syntactically correct or not.
True, this is the core issue. Think of a turing machine processing an infinite tape. It is writing output and producing a final result; it is possible to examine the output tape before knowing the final result. Haskell parsers insist on having the "output tape" live in memory until the final result is known, and then give it to you as one huge object when done (or discard it and just give you the error message). NOT a good idea if your input tape is a 200 MB file!
It's nothing to do with Haskell or memory mangagement, you just can't decide whether the whole input is well-formed until you're done parsing, just like you can't in general decide if a Turing machine is going to terminate until it has. You have to accept not knowing whether the input is well-formed until you get to the end. There are two ways to do this that make it easy to get streaming right. One is to have a data structure that explicitly contains the possiblity of errors, like data ErrList a = Another a (ErrList a) | Done | Failed err Another is to return an ordinary structure containing values that will raise an error when examined, and wrap a catch around the code processing the streaming results. You might return for example a result [1,2,3,error "parse error at 10:3 blah blah blah"] You chose the most difficult way, returning immediately a structure that has a field that when examined blocks until the input is done and tells you if everything is valid. That's tricky becuase it's very easy to make that field be some unevaluated code that hangs onto the complete list of tokens and so on, something like (a thunk of) "first_line_valid && second_line_valid && ..." GHC doesn't just go out and evaluate thunks onces their dependencies arrive, because sometimes that's a bad idea, most obviously it it's something like an unevaluated infinite list, say [1..], which has no free parameters. It's the same problem you see in --argument to break sharing input () = 'a' : input () main = let text = input() in putStr (text ++ [last text]) As the infinite list is unfolded the thunk for "last text" is still hanging onto the beginning, so it can't be garbage collected. It happens that you can incrementally compute length as the list is unfolded, but it's somewhat beyond the compiler to figure that out for itself. But, you can fix it by writing a function that does both operations together: list_followed_by_length l = rec l 0 where rec (x:xs) len = len `seq` (x:rec xs (len + 1)) rec [] = show len Another option, if you're determined to be fancy, is to use the one case where GHC actually does decide to evaluate something a little bit during garbage collection. It's called a "selector thunk" - if a piece of unevaluated code is *exactly* (after optimization) case x of (_, .. , projected, ... _) -> projected, or an equivalent pattern match on another data type with just a single constructor it will be replaced by a direct reference to x as evaluation proceeds. If you want to go this way, add the -ddump-simpl flag to GHC and inspect the output, and see what adding -O or -O2 does to it. Brandon
participants (4)
-
apfelmus
-
Brandon Michael Moore
-
Claus Reinke
-
Oren Ben-Kiki