
2009/2/18 Tyson Whitehead
On February 17, 2009 19:24:44 Max Bolingbroke wrote:
2009/2/17 Tyson Whitehead
: (compiled with ghc 6.10 with options -O2 -ddump-simpl)
That should have been -ddump-stranal instead of -ddump-simpl.
Right. Mystery solved. In case you were wondering, the reason the other two lambdas in "digit" are not worker/wrappered is because GHC only performs strictness analysis on lambdas that occur /immediately/ in the binding. So increasing the arity of digit is CRUCIAL to getting your program to compile well, as that will cause: * Strictness analysis on the inner lambdas * Inlining of lvl_s1mF * Fewer updateable thunks to be created by callers (as we won't try to cache the results of a partial application)
I was wondering why lvl_s1mF is not being inlined into a_s1Gv in the core at the bottom of this email as that is the only place it is ever referenced.
The relevant GHC code is SimplUtils.preInlineUnconditionally. It looks like it dosen't get inlined for two reasons: 1) It's not a manifest lambda (it's an application) so inlining inside another lambda would change the number of times the FVs of lvl_s1mF might occur
I have to confess my ignorance here as my google fu failed and so I still don't know what a manifest lambda is (other than not a application). : )
Sorry :-). Manifest lambdas are just lambdas that occur manifestly! So e.g.: f = \x -> g x HAS a manifest lambda, whereas: f' = g Does NOT have one. No lambda immediately starts the binding. Lambdas, manifest or otherwise, are important in this context because any value with a leading lambda is certainly very cheap and so can be inlined inside other lambdas - which is what you are trying to achieve with lvl_s1mF. (snip code and explanation)
I just finished adding the Parse q Int type to help with email line wrapping. As I alluded to in my original email, if I don't have the Int overflow check in digit, it is not chosen as the loop breaker, all the StateT stuff is compiled away, and you get a really nice efficient assembler loop (which is important because the final FSM has to actually chew through GBs of data).
The part of the code under the first lambda in digit is as follows (I didn't keep the original dump, so the uniques have changed here). It's the second part of the Int overflow bounds check (i.e., y <= (maxBound-x)`div`10), and, indeed, something you don't want to compute unless the easy check fails.
Yes - GHC wants to share the work of (maxBound-x)`div`10 between several partial applications of "digit". This is usually a good idea, but in this case it sucks because it's resulted in a massively increased arity. IMHO GHC should fix this by: * Marking divInt# INLINE in the base library. This would result in your code would just containing uses of quotInt# * Making some operations cheap even if they may fail (PrimOp.primpOpIsCheap should change). Though this might mean that we turn non-terminating programs into terminating ones (such operations get pushed inside lambdas) but this is consistent with our treatment of lambdas generally. Actually, your divInt# call wouldn't even usually be floated out to between two lambdas, but at the time FloatOut runs there is something in between the \x lambda and the lambdas from the state monad - the monadic bind operator! So FloatOut feels free to move the computation for "x" up even though that >>= will go away as soon as we run the simplifier. What a disaster! For me, making digit INLINE fixes all this, but that's probably because the context of my code is not the same as yours (I had to invent parts of a program to bind the bs, q and n variables). For your immediate problem, I would suggest this bit of GHC witchdoctory: where digit :: Int -> ParseInt Int Int digit !x = do !y <- lift get ( if y <= (maxBound-9)`quot`10 || y <= ghc_hacks then let !y' = y*10+x in (lift $ put y') >> s2 else throwError "integer overflow" ) where {-# INLINE ghc_hacks #-} ghc_hacks = (maxBound-x)`div`10 With luck, this should make the loop-invariant cheap in GHC's eyes, preventing it from trying to share it. It works for me - but like I say things may differ in your program.
In general, the -ddump-inlinings flag is useful for working out why something wasn't inlined - but it wouldn't have helped you in this case, because it only dumps information about inlining at call sites, and you actually want an unconditional inlining to occur.
I also tried that, and didn't have much luck with it. I didn't understand the output, which there was 48k lines worth of, and the uniques kept changing which made it hard to grep for names from previous -ddump-simpl runs.
Right. I tend to use -ddump-inilnings and -dverbose-core2core (or -ddump-simpl) together, and I have a little script that partitions the output up by the "=====" seperators. Then I can look at the output of a phase (which is split into it's own file) and if I see something that should have been inlined I can go to the output of the previous simplifier run (also in it's own file) and grep for it's name in the -ddump-inlinings output of that phase. This means you don't have to match uniques between two entierly seperate runs, although annoyingly the output format for identifier in -ddump-inlinings is slightly different from that in Core output so grepping is not as easy as it could be. If you are interested in trying my script, it's available via Git at http://github.com/batterseapower/scripts/blob/da1f24ba16c27e3994aa66f9db352e... and is called as "ghc-dump-split ghc-dump-file", where ghc-dump-file results from redirection of GHC stdout+stderr to a file. Hope all that helps, Max