Performance: Faster to define a function writing out all arguments?

Hi there, I am trying to write my first serious Haskell program, but have problems understanding 'strange' performance results. It seems to be a ghc specific question, so I am asking here. In a happy parser I have this code 1): %monad { Parsed } { thenP } { returnP } type Parsed = State Env.Env returnP :: a -> Parsed a returnP a = return a thenP :: Parsed a -> (a -> Parsed b) -> Parsed b thenP x k = x >>= k An alternative was this 2) (yes, redundant in happy): returnP :: a -> Parsed a returnP = return thenP :: Parsed a -> (a -> Parsed b) -> Parsed b thenP = (>>=) And strangely enough on my machine 1) is faster by a few percent than 2) using ghc 6.8.2 with -O2. A few percent might seem unimportant, but I am currently developing my Haskell style. And I want to write efficient code by default, if that doesn't lead to obfuscation. Now, first I dislike using option 1) because it seems pointlessly verbose. Second, I have no clue why option 2) should be faster. Using ghc -O2 -C I get two code blocks with 1) (appended at the end of this mail), whereas I get nothing with 2. Unfortunately, that doesn't help me as I have no clue what it means. There are also various strictness annotations throughout the code to make parsing less memory hungry (and getting an intuition about what is appropriate here drives me crazy). Random guess: perhaps somehow strictness analysis of ghc benefits from the code in 1)? Or does ghc create specialized versions of return and >>= for State instantiated to Parsed, i.e. State Env.Env? Anyhow, I am puzzled and grateful for any explanations and recommendations about what is going on and when to avoid style 2). Thanks, Alexander C-- code blocks: ------------------------------------------------------- EI_(DarwinziTptpziParser_a106_info); StgWord DarwinziTptpziParser_a106_closure[] = { (W_)&DarwinziTptpziParser_a106_info }; static StgWord s8Ey_info[] = { 0x42U, 0x22U }; IF_(s8Ey_ret) { W_ _c8EL; FB_ Sp[2] = *((P_)(R1.w+7)); _c8EL = *((P_)(R1.w+3)); R1.w = Sp[1]; Sp[1] = _c8EL; Sp=Sp+1; JMP_((W_)&stg_ap_pp_fast); FE_ } StgWord DarwinziTptpziParser_a106_info[] = { 0x30014U, 0x0, 0xfU }; EI_(DarwinziTptpziParser_a106_closure); II_(s8Ey_info); FN_(DarwinziTptpziParser_a106_entry) { FB_ if ((W_)(((W_)Sp - 0x4U) < (W_)SpLim)) goto _c8EO; R1.w = *Sp; Sp[-1] = Sp[2]; *Sp = (W_)&s8Ey_info; Sp=Sp-1; JMP_((W_)&stg_ap_p_fast); _c8EO: R1.w = (W_)&DarwinziTptpziParser_a106_closure; JMP_(stg_gc_fun); FE_ } ------------------------------------------------------- ------------------------------------------------------- II_(r6AT_info); static StgWord r6AT_closure[] = { (W_)&r6AT_info, 0x0, 0x0, 0x0 }; static char c97W_str[] = "Internal Happy error\012"; static StgWord r6AT_info[] = { 0x0, 0x16U }; EI_(base_GHCziBase_unpackCStringzh_info); IF_(r6AT_entry) { FB_ if ((W_)(((W_)Sp - 0xcU) < (W_)SpLim)) goto _c97Z; Hp=Hp+2; if ((W_)((W_)Hp > (W_)HpLim)) goto _c97Z; Hp[-1] = (W_)&stg_CAF_BLACKHOLE_info; newCAF((void *)R1.w); R1.p[1] = (W_)Hp-4; *R1.p = (W_)&stg_IND_STATIC_info; Sp[-2] = (W_)&stg_upd_frame_info; Sp[-1] = (W_)Hp-4; Sp[-3] = (W_)&c97W_str; Sp=Sp-3; JMP_((W_)&base_GHCziBase_unpackCStringzh_info); _c97Z: HpAlloc = 0x8U; JMP_(stg_gc_enter_1); FE_ } EI_(base_GHCziErr_error_closure); II_(r6AT_closure); StgWord DarwinziTptpziParser_notHappyAtAll_srt[] = { (W_)&base_GHCziErr_error_closure, (W_)&r6AT_closure }; EI_(DarwinziTptpziParser_notHappyAtAll_info); StgWord DarwinziTptpziParser_notHappyAtAll_closure[] = { (W_)&DarwinziTptpziParser_notHappyAtAll_info, 0x0, 0x0, 0x0 }; StgWord DarwinziTptpziParser_notHappyAtAll_info[] = { ((W_)&DarwinziTptpziParser_notHappyAtAll_srt+0), 0x0, 0x30016U }; EI_(base_GHCziErr_error_closure); II_(r6AT_closure); FN_(DarwinziTptpziParser_notHappyAtAll_entry) { FB_ if ((W_)(((W_)Sp - 0xcU) < (W_)SpLim)) goto _c989; Hp=Hp+2; if ((W_)((W_)Hp > (W_)HpLim)) goto _c989; Hp[-1] = (W_)&stg_CAF_BLACKHOLE_info; newCAF((void *)R1.w); R1.p[1] = (W_)Hp-4; *R1.p = (W_)&stg_IND_STATIC_info; Sp[-2] = (W_)&stg_upd_frame_info; Sp[-1] = (W_)Hp-4; R1.w = (W_)&base_GHCziErr_error_closure; Sp[-3] = (W_)&r6AT_closure; Sp=Sp-3; JMP_((W_)&stg_ap_p_fast); _c989: HpAlloc = 0x8U; JMP_(stg_gc_enter_1); FE_ } -------------------------------------------------------

Hi
And strangely enough on my machine 1) is faster by a few percent than
Consider a few percent to be "noise". It may not really be a faster result, and it may not have anything to do with what you wrote.
A few percent might seem unimportant, but I am currently developing my Haskell style. And I want to write efficient code by default, if that doesn't lead to obfuscation.
Write clear code by default - GHC can usually make the leap between clear and efficient. You may want to read http://haskell.org/haskellwiki/Performance - but if you are developing your Haskell style I'd not bother yet. Thanks Neil

alexander.fuchs:
Hi there,
I am trying to write my first serious Haskell program, but have problems understanding 'strange' performance results. It seems to be a ghc specific question, so I am asking here.
In a happy parser I have this code 1):
%monad { Parsed } { thenP } { returnP }
type Parsed = State Env.Env
returnP :: a -> Parsed a returnP a = return a
thenP :: Parsed a -> (a -> Parsed b) -> Parsed b thenP x k = x >>= k
An alternative was this 2) (yes, redundant in happy):
returnP :: a -> Parsed a returnP = return
thenP :: Parsed a -> (a -> Parsed b) -> Parsed b thenP = (>>=)
Could you submit a minimal, complete grammar file, so we can look at the generated code? -- Don

Hi Don. Don Stewart wrote:
alexander.fuchs:
In a happy parser I have this code 1):
%monad { Parsed } { thenP } { returnP }
type Parsed = State Env.Env
returnP :: a -> Parsed a returnP a = return a
thenP :: Parsed a -> (a -> Parsed b) -> Parsed b thenP x k = x >>= k
An alternative was this 2) (yes, redundant in happy):
returnP :: a -> Parsed a returnP = return
thenP :: Parsed a -> (a -> Parsed b) -> Parsed b thenP = (>>=)
Could you submit a minimal, complete grammar file, so we can look at the generated code?
After reducing and simplifying the grammar file to the minimum for my sample input I can't see any difference in performance anymore. Due to other code changes the difference is almost non-existent anymore even before simplification. I blame laziness for my lack of intuition of what is going on ;-) Anyway, as I am still wondering why ghc creates different code for returnP a = return a returnP = return I put the grammar file online, in case you still want to have a look at it: http://cs.uiowa.edu/~fuchs/DarwinCS/DarwinCS/Darwin/Tptp/Parser.y The complete source (compiles with make opt) is here: http://cs.uiowa.edu/~fuchs/DarwinCS/DarwinCS.tar.gz And I use it on this input: http://cs.uiowa.edu/~fuchs/DarwinCS/SYN854-1.p with the command line: time Darwin/Main +RTS -sstderr -RTS 600 SYN854-1.p Sorry about the noise, Alexander

alexander.fuchs:
Hi Don.
Could you submit a minimal, complete grammar file, so we can look at the generated code?
After reducing and simplifying the grammar file to the minimum for my sample input I can't see any difference in performance anymore. Due to other code changes the difference is almost non-existent anymore even before simplification. I blame laziness for my lack of intuition of what is going on ;-)
Anyway, as I am still wondering why ghc creates different code for returnP a = return a returnP = return
Ah, now I rember this coming up before. Simon, is this a CAF issue? Or did it trigger the -fno-state-hack case? I've definitely run into the odd other case where point-freeing a bit of code messes with the complexity. -- Don

| > Anyway, as I am still wondering why ghc creates different code for | > returnP a = return a | > returnP = return | > | | Ah, now I rember this coming up before. | | Simon, is this a CAF issue? Or did it trigger the -fno-state-hack case? I'm not sure. A small example would be good. | I've definitely run into the odd other case where point-freeing | a bit of code messes with the complexity. That should not happen -- except for the state-hack. Which is this: Consider f1 :: Char -> IO () f1 = \c. let i = ord c in \s. print i s Here s::State World. Is this equivalent to f2 = \c.\s. print (ord c) s The latter is very much more efficient than 'f1', because it doesn't build an intermediate lambda. This matters a lot in I/O-intensive programs. But if 'f' is called thus: forever (f 'w') then (ord 'w') is computed once for each call of f2, but is shared among all calls to f1. And that can make a big difference too. I have no idea whether this is playing a role in the example you are discussing -- my guess is not, and there may be another performance bug lurking. So eliciting a small demo would be great. Simon

Hi. Simon Peyton-Jones wrote:
| I've definitely run into the odd other case where point-freeing | a bit of code messes with the complexity.
That should not happen -- except for the state-hack. Which is this:
Consider f1 :: Char -> IO () f1 = \c. let i = ord c in \s. print i s
Here s::State World. Is this equivalent to f2 = \c.\s. print (ord c) s
The latter is very much more efficient than 'f1', because it doesn't build an intermediate lambda. This matters a lot in I/O-intensive programs. But if 'f' is called thus:
forever (f 'w')
then (ord 'w') is computed once for each call of f2, but is shared among all calls to f1. And that can make a big difference too.
Thanks for the explanation. State World here really means the IO and ST monad, not the State monad, right? Compiling with -fno-state-hack actually does actually lead to both versions (the point-free and explicit) being equally fast (14.4s instead of 15.2s resp. 14.0s). Though I am not sure why. Running in profiled mode makes all difference vanish. I tested this on an Intel(R) Pentium(R) 4 CPU 3.00GHz and got the reported speed difference there. Using instead an AMD Athlon(tm) 64 Processor 3800+ I got no difference even without using -fno-state-hack.
I have no idea whether this is playing a role in the example you are discussing -- my guess is not, and there may be another performance bug lurking. So eliciting a small demo would be great.
I am sorry, I completely failed to do that. The parser builds up a data structure from several data types using memoization. Any significant simplification of the code base reduced the performance gap until it quickly disappeared. Alexander
participants (4)
-
Alexander Fuchs
-
Don Stewart
-
Neil Mitchell
-
Simon Peyton-Jones