
On Tue, Apr 10, 2007 at 11:03:32AM -0700, Oren Ben-Kiki wrote:
On Tue, 2007-04-10 at 12:14 +0200, apfelmus wrote:
Oren Ben-Kiki wrote:
The code is in http://hpaste.org/1314#a1 if anyone at all is willing to take pity on me and explain what I'm doing wrong.
There is an important point to note about streaming, namely that it conflicts with knowing whether the input is syntactically correct or not.
True, this is the core issue. Think of a turing machine processing an infinite tape. It is writing output and producing a final result; it is possible to examine the output tape before knowing the final result. Haskell parsers insist on having the "output tape" live in memory until the final result is known, and then give it to you as one huge object when done (or discard it and just give you the error message). NOT a good idea if your input tape is a 200 MB file!
It's nothing to do with Haskell or memory mangagement, you just can't decide whether the whole input is well-formed until you're done parsing, just like you can't in general decide if a Turing machine is going to terminate until it has. You have to accept not knowing whether the input is well-formed until you get to the end. There are two ways to do this that make it easy to get streaming right. One is to have a data structure that explicitly contains the possiblity of errors, like data ErrList a = Another a (ErrList a) | Done | Failed err Another is to return an ordinary structure containing values that will raise an error when examined, and wrap a catch around the code processing the streaming results. You might return for example a result [1,2,3,error "parse error at 10:3 blah blah blah"] You chose the most difficult way, returning immediately a structure that has a field that when examined blocks until the input is done and tells you if everything is valid. That's tricky becuase it's very easy to make that field be some unevaluated code that hangs onto the complete list of tokens and so on, something like (a thunk of) "first_line_valid && second_line_valid && ..." GHC doesn't just go out and evaluate thunks onces their dependencies arrive, because sometimes that's a bad idea, most obviously it it's something like an unevaluated infinite list, say [1..], which has no free parameters. It's the same problem you see in --argument to break sharing input () = 'a' : input () main = let text = input() in putStr (text ++ [last text]) As the infinite list is unfolded the thunk for "last text" is still hanging onto the beginning, so it can't be garbage collected. It happens that you can incrementally compute length as the list is unfolded, but it's somewhat beyond the compiler to figure that out for itself. But, you can fix it by writing a function that does both operations together: list_followed_by_length l = rec l 0 where rec (x:xs) len = len `seq` (x:rec xs (len + 1)) rec [] = show len Another option, if you're determined to be fancy, is to use the one case where GHC actually does decide to evaluate something a little bit during garbage collection. It's called a "selector thunk" - if a piece of unevaluated code is *exactly* (after optimization) case x of (_, .. , projected, ... _) -> projected, or an equivalent pattern match on another data type with just a single constructor it will be replaced by a direct reference to x as evaluation proceeds. If you want to go this way, add the -ddump-simpl flag to GHC and inspect the output, and see what adding -O or -O2 does to it. Brandon