jhc vs ghc and the surprising result involving ghc generated assembly.

(I apologize in advance if this message seems self congradulatory, but after a long time of being disheartened by jhc only having marginal gains over ghc, I am finally seeing some substantial benefits, many of which are the result of optimizations that can actually be ported back to ghc) So, what started out as a simple pat on my own back for getting strictness and CPR finally working turned into an adventure in assembly language. An incidental comparason between jhc and ghc's output completly surprised me, for tight mathematical inner loops jhc is producing code that runs 3-7x faster than ghc. This surprised me because I expected them to be identical, they determine the same strictness and produce the same worker/wrapper split, but due to a quirk of how ghc generates C code, it ends up producing an ultimatly slower result. fortunatly, I think the problem is easy to mitigate. (and jhc loses its lead once again :) ) (all the following snippets are unedited/unreformatted output of the programs specified except for the removal of some profiling tags) the motivating example is the plain old basic accumulating factorial function: fac :: Int -> Int -> Int fac 1 r = r fac n r = fac (n - 1) (n*r) the strictness info is correctly infered that fac is strict in all its arguments and returns a CAF so it is translated into a nice and speedy version that takes unboxed arguments and returns unboxed results. here is the C code that jhc generates. (As an aside, I am very proud of how readable and how much structure the jhc generated C code preserves of the original haskell. it's a small thing, perhaps only implementors appreciate it, but I am glad I spent the time needed to do so.) /* fW@.fMain.fac */ static int fWXAXDfMainXDfac(int v94, int v95) { int v96; int v97; int v98; int v99; switch (v94) { case 1: return v95; break; default: v96 = v94; v97 = (int)(v94 - 1); v98 = (int)(v94 * v95); v99 = fWXAXDfMainXDfac(v97, v98); return v99; break; } } notice that besides being a bit verbose and using a tailcall, this is exactly what A C programmer would write. In fact, the generated assembly is quite nice and I think perhaps provably optimal even compared to hand-coded assembly :) jhc assembly output: fWXAXDfMainXDfac: .L108: cmpl $1, %edi je .L110 imull %edi, %esi decl %edi jmp .L108 .L110: movl %esi, %eax ret notice, a tiny inner loop, 4 instructions and one conditional jump. no memory acceses whatsoever. now, lets look at what ghc does with the same function: the generated C code: EI_(Main_zdwfac_info); FN_(Main_zdwfac_entry) { W_ _s27g; W_ _s27i; W_ _s27l; FB_ _s27g = *Sp; if (_s27g != 0x1UL) goto _c282; R1.p = (P_)(Sp[1]); Sp=Sp+2; JMP_(*Sp); _c282: _s27l = _s27g * (Sp[1]); _s27i = _s27g - 0x1UL; Sp[1] = _s27l; *Sp = _s27i; JMP_((W_)&Main_zdwfac_info); FE_ } it looks complicated, but what it effectivly does is pop the arguments off the stack, run the code but with an explicit jump rather than recursion and push the result back onto the stack. other than the stack stuff at the beginnig and end, we would expect this to get compiled to roughly the same assembly as the jhc version with a nice tight inner loop and just some stack futzing boilerplate. and now the generated assembly. Main_zdwfac_info: .text .align 8 .text movq (%rbp), %rdx cmpq $1, %rdx jne .L2 movq 8(%rbp), %r13 leaq 16(%rbp), %rbp movq (%rbp), %rax .L4: jmp *%rax .L2: movq %rdx, %rax imulq 8(%rbp), %rax movq %rax, 8(%rbp) leaq -1(%rdx), %rax movq %rax, (%rbp) movl $Main_zdwfac_info, %eax jmp .L4 ack! lets count what happens on each iteration of the loop: we have 5 (!) memory acceses and two jumps, one of them being indirect! in fact, each time through the loop, it loads the same values into the same registers. this is really bad compared to the jhc assembly. the basic issue is an interaction between ghc's use of global registers, a stack, and indirect calls. an indirect jump is very expensive. modern processors are pipelined, having many instructions in the queue at once, looking ahead and beginning to evaluate what is coming up. when an indirect jump is encountered, the CPU has no choice but to flush the whole pipeline because it has no idea where it goes, even with conditional branches cpus can predict with good accuracy which branch will be taken and at worst, it discards the pipeline, but with an indirect jump, there is no chance of it having a full pipeline. However the major issue is the following. %rbp is the global stack pointer pointing to the STG stack, since it is global it can be modified from anywhere, since gcc can't know if the function it jumps to modified said register it has no choice but to load all values relative to it every time through the loop because it may have changed. furthermore gotos and labels are very problematic for gcc to optimize around. for various tiresome reasons gcc cannot perform (most) code motion optimizations across explicit labels and gotos, especially when they deal with the global register variables and memory stores and loads. since these are arguably some of the most important optimizations, this is quite bad for the generated assembly. loop: if () goto loop; is not equivalent to a do-while loop, loop invarients cannot be hoisted out of the above for instance (except in some cases... it is all quite tricky and we want gcc to have as much freedom as possible). all in all, this makes the code generated by gcc compiling something generated by ghc not very good at all. there are a couple of things we can do to mitigate these problems: get rid of indirect jumps whenever possible. use C control constructs rather than gotos. A couple simple rules seem to help greatly. * turn anything of the form JMP_((W_)&self) where self is oneself into a goto that gotos a label at the beginning of the function. * do simple pattern matthing on the basic blocks to recognize where C control constructs can be placed. for instance turn if (x) { goto y; } blah.. baz.. JMP_(foo) into if (x) { goto y; } else { blah.. baz.. JMP_(foo) } extending the else to after the next jump or goto. * getting stack dereferences out of your loops. manually performing the first two optimizations mentioned above we get: EI_(Main_zdwfac_info); FN_(Main_zdwfac_entry) { W_ _s27g; W_ _s27i; W_ _s27l; FB_ fac_entry: _s27g = *Sp; if (_s27g != 0x1UL) { goto _c282; } else { R1.p = (P_)(Sp[1]); Sp=Sp+2; JMP_(*Sp); } _c282: _s27l = _s27g * (Sp[1]); _s27i = _s27g - 0x1UL; Sp[1] = _s27l; *Sp = _s27i; goto fac_entry; FE_ } and this produces the assembly: Main_zdwfac_info: .text .align 8 .text movq %rbp, %rdx movq (%rbp), %rcx cmpq $1, %rcx je .L3 .L6: movq %rcx, %rax imulq 8(%rdx), %rax movq %rax, 8(%rdx) leaq -1(%rcx), %rax movq %rax, (%rbp) .L4: movq %rbp, %rdx movq %rax, %rcx cmpq $1, %rax jne .L6 .L3: movq 8(%rdx), %r13 leaq 16(%rdx), %rbp jmp *(%rbp) we still have some unnecesarry memory accesses, but the indirect jump and the spurious jump are gone and we have less instructions in the main loop making this code noticably faster. in order to get rid of the unessesary memory accesess, we need to either 1. fully convert it to use C control constructs, so gcc will do it for us. (code motion and loop invarient being inhibited again by gotos) or 2. do it ourselves by analyzing when the consumer of what we are putting on the stack is ourselves. the first is greatly preferable but not always possible. These should be straightforward to implement in the C code generator. it also suggests we might want to try to use the native C calling convention on leaf nodes that deal with unboxed values (so we get register passing and return values for free) or taking groups of mutually recursive functions and turning them all into one function with explicit jumps between them. to show they are actually optimizing the same function, here are both of their core representaions, other than syntax, they are identical: jhc core: (uses unicode, utf8 formatted) W@.fMain.fac∷int → int → int = λx9216∷int.λx9222∷int.(case x9216 of 1 → x9222 ; x9238∷case <(int)-(int,int) x9216 1∷int> of x9252∷int → case <(int)*(int,int) x9216 x9222∷int> of x34∷int → case W@.fMain.fac x9252 x34 of x9208∷int → x9208;;;;;) ghc core: %rec {zdwfac :: GHCziPrim.Intzh -> GHCziPrim.Intzh -> GHCziPrim.Intzh = \ (ww::GHCziPrim.Intzh) (ww1::GHCziPrim.Intzh) -> %case (GHCziPrim.Intzh) ww %of (ds::GHCziPrim.Intzh) {%_ -> Main.zdwfac (GHCziPrim.zmzh ds (1::GHCziPrim.Intzh)) (GHCziPrim.ztzh ds ww1); (1::GHCziPrim.Intzh) -> ww1}}; some random notes: the 3x-7x factor was tested on an i386, on x86_64 the margin is much much greater for reasons that are still unclear. while testing this I noticed that jhc and ghc have dramatically different results on x86_64 pretty much across the board, if their programs take comparable amounts of time on i386, the jhc one will run twice as fast as the ghc one on x86_64. I think ghc must be tickling the x86_64 in a particularly bad way for this dramatic of an effect to be observed. I will poke around the generated assembly of ghc some more and see if I can find the culprit. (this is also online at: http://repetae.net/jhc-vs-ghc-assembly.txt ) John -- John Meacham - ⑆repetae.net⑆john⑈

Nice analysis. I indeed found with phc that shadow stack references absolutely killed performance, and I aggressively cached stack locations in locals, spilling to stack only when GC information needed to be accurate. [There was a giant infrastructure to save only live data to stack, but we won't go into that now as it was the source of almost all the codegen bugs...] On Oct 26, 2005, at 5:43 AM, John Meacham wrote:
here is the C code that jhc generates. (As an aside, I am very proud of how readable and how much structure the jhc generated C code preserves of the original haskell. it's a small thing, perhaps only implementors appreciate it, but I am glad I spent the time needed to do so.)
This makes a big difference. The phc compiler even put comments in the code so that I could figure out what came from where.
v99 = fWXAXDfMainXDfac(v97, v98); return v99; ... notice that besides being a bit verbose and using a tailcall,
I'm impressed that gcc found this. It's definitely living a bit dangerously, and your suggestions below for self tail call handling are the ones I found most effective. (They also allowed me to bypass some prologue garbage, since phc used a one-C-function-per-Haskell- function model with internal resumption points.) Non-self tail calls I was careful to compile to: return f(...); I expect from the above that gcc does better at spotting tail calls now.
furthermore gotos and labels are very problematic for gcc to optimize around. for various tiresome reasons gcc cannot perform (most) code motion optimizations across explicit labels and gotos, especially when they deal with the global register variables and memory stores and loads. ...
there are a couple of things we can do to mitigate these problems:
get rid of indirect jumps whenever possible.
use C control constructs rather than gotos.
"for" loop introduction would be especially nice, but a bit tricky in practice I fear (requiring a game of "spot the induction variable").
A couple simple rules seem to help greatly.
* turn anything of the form JMP_((W_)&self) where self is oneself into a goto that gotos a label at the beginning of the function.
Or better yet, wrap the whole function in do { } while (1); and replace "JMP_" by "continue". This avoids the troubles with goto which John mentioned above. It made a difference for phc, at least. Of course, if you can introduce loops elsewhere you might get yourself into trouble with this solution.
* do simple pattern matthing on the basic blocks to recognize where C control constructs can be placed.
for instance turn
if (x) { goto y; } blah.. baz.. JMP_(foo)
into
if (x) { goto y; } else { blah.. baz.. JMP_(foo) }
extending the else to after the next jump or goto.
I'm surprised this actually helps, I must admit.
* getting stack dereferences out of your loops.
I recommend caching stack references in C locals where possible, but it's tricky to get this right if loop bodies include embedded function calls. Last I checked this wasn't an issue for GHC, since function calls were CPS-converted and only tight call-free loops ended up in a single function anyway.
in order to get rid of the unessesary memory accesess, we need to either
1. fully convert it to use C control constructs, so gcc will do it for us. (code motion and loop invarient being inhibited again by gotos)
As I recall, the "right" solution here is to compute dominator trees, and coalesce functions which are only tail called from their dominator into a single function. Alas, I've forgotten where I saw this written up, but there are indeed papers on it. Because it takes a bunch of effort on the part of the implementor, it'd be nice to see if its benefits are quantified.
These should be straightforward to implement in the C code generator. it also suggests we might want to try to use the native C calling convention on leaf nodes that deal with unboxed values (so we get register passing and return values for free) or taking groups of mutually recursive functions and turning them all into one function with explicit jumps between them.
Making sure things are marked "static" and occur in an appropriate dependency order helps a bit here. It might even be worth marking some stuff "inline" in the code generator, though that's shaky ground. I actually considered making everything static and putting outwardly- visible functionality in an extern wrapper---effectively carrying worker-wrapper down to the C level.
some random notes:
the 3x-7x factor was tested on an i386, on x86_64 the margin is much much greater for reasons that are still unclear.
Does x86-64 use a register-based calling convention by default? If you compiled the i386 code using __regparm(2), would you see the same speed difference? -Jan-Willem Maessen

On Wed, Oct 26, 2005 at 12:24:14PM -0400, Jan-Willem Maessen wrote:
Nice analysis. I indeed found with phc that shadow stack references absolutely killed performance, and I aggressively cached stack locations in locals, spilling to stack only when GC information needed to be accurate. [There was a giant infrastructure to save only live data to stack, but we won't go into that now as it was the source of almost all the codegen bugs...]
phc? there is another haskell compiler out there?
This makes a big difference. The phc compiler even put comments in the code so that I could figure out what came from where.
yeah, that is something I would like to add. Unfortunatly I wasn't forward thinking enough to put annotation points everywhere I might eventually need them so I will have to go through and do that at some point :)
I'm impressed that gcc found this. It's definitely living a bit dangerously, and your suggestions below for self tail call handling are the ones I found most effective. (They also allowed me to bypass some prologue garbage, since phc used a one-C-function-per-Haskell- function model with internal resumption points.) Non-self tail calls I was careful to compile to: return f(...); I expect from the above that gcc does better at spotting tail calls now.
indeed. it is actually quite good at spotting tail calls, I thought it would be an issue but it has caught the important ones in generated code, and even done invarient hoisting out of the loop! it is not evident from the x86-64 assembly I posted since stuff comes in registers to begin with, but in the i386 it sets up everything outside the main loop and has the same memory access free little inner loop. little things like storing stuff in temporary variables doesn't seem to confuse gcc. I think this describes the new algorithm.. but am unsure. http://home.in.tum.de/~baueran/thesis/ it is interesting that they do it to support ghc, but ghc uses explicit continuations so compiler based tail calls arn't as important?
furthermore gotos and labels are very problematic for gcc to optimize around. for various tiresome reasons gcc cannot perform (most) code motion optimizations across explicit labels and gotos, especially when they deal with the global register variables and memory stores and loads. ...
there are a couple of things we can do to mitigate these problems:
get rid of indirect jumps whenever possible.
use C control constructs rather than gotos.
"for" loop introduction would be especially nice, but a bit tricky in practice I fear (requiring a game of "spot the induction variable").
yeah, but do and while loops should be easy enough. just look for basic blocks that point back to somewhere above them and have a single other path out of them... and if could be used whenever there are exactly two code paths out of a given block.
A couple simple rules seem to help greatly.
* turn anything of the form JMP_((W_)&self) where self is oneself into a goto that gotos a label at the beginning of the function.
Or better yet, wrap the whole function in
do { } while (1);
and replace "JMP_" by "continue". This avoids the troubles with goto which John mentioned above. It made a difference for phc, at least. Of course, if you can introduce loops elsewhere you might get yourself into trouble with this solution.
* do simple pattern matthing on the basic blocks to recognize where C control constructs can be placed.
for instance turn
if (x) { goto y; } blah.. baz.. JMP_(foo)
into
if (x) { goto y; } else { blah.. baz.. JMP_(foo) }
extending the else to after the next jump or goto.
I'm surprised this actually helps, I must admit.
yeah, me too. but it seems to. I think while gcc is very good at compiling idiomatic C, when you start doing things tricky like using goto's it just shuts off a lot of its optimizations rather than figuring out how to work around them since it is not really vital for most any C programs.
* getting stack dereferences out of your loops.
I recommend caching stack references in C locals where possible, but it's tricky to get this right if loop bodies include embedded function calls. Last I checked this wasn't an issue for GHC, since function calls were CPS-converted and only tight call-free loops ended up in a single function anyway.
I have a sneaking suspicion gcc might be treating global register variables as if they were volatile. meaning it must treat them as if they could be modified by external code at any time. now, usually that doesn't matter for a register since it doesn't need to be loaded to and from memory. however it means that every memory dereference relative to it will have to be done anew each time. and since a whole lot of ghc's goings on is dereferencing things relative to the stack, this is a major issue. I think it actually might actually be better to just pass the global register variables as standard arguments to every (non-leaf) function, This would have a few advantages i think. on machines with a register passing arch (like x86_64) it will be effectivly the same as having a global register, except the compiler will be free to spill them to the stack like normal and not inhibit any normal optimizations. in addition the register would be available for reuse in leaf-functions automatically since they need not keep track of the stack. I think.
in order to get rid of the unessesary memory accesess, we need to either
1. fully convert it to use C control constructs, so gcc will do it for us. (code motion and loop invarient being inhibited again by gotos)
As I recall, the "right" solution here is to compute dominator trees, and coalesce functions which are only tail called from their dominator into a single function. Alas, I've forgotten where I saw this written up, but there are indeed papers on it. Because it takes a bunch of effort on the part of the implementor, it'd be nice to see if its benefits are quantified.
yeah, I think that is the right way to. I don't think it would be a whole lot of effort. it is a fairly localized optimization and these sort of algorithms are usually quite easy to express in haskell.
These should be straightforward to implement in the C code generator. it also suggests we might want to try to use the native C calling convention on leaf nodes that deal with unboxed values (so we get register passing and return values for free) or taking groups of mutually recursive functions and turning them all into one function with explicit jumps between them.
Making sure things are marked "static" and occur in an appropriate dependency order helps a bit here. It might even be worth marking some stuff "inline" in the code generator, though that's shaky ground.
I actually considered making everything static and putting outwardly- visible functionality in an extern wrapper---effectively carrying worker-wrapper down to the C level.
This is what I do in jhc if I understand you. everything but main (and FFI exported functions) are static. gcc actually has a -funit-at-a-time now which will read in the entire source file and perform whole-module analysis and optimization, so there isn't a need to get the ordering of functions just right and it can be cleverer when it comes to mutually recursive functions.
some random notes:
the 3x-7x factor was tested on an i386, on x86_64 the margin is much much greater for reasons that are still unclear.
Does x86-64 use a register-based calling convention by default? If you compiled the i386 code using __regparm(2), would you see the same speed difference?
that is a good question. not sure. I will test it out. the x86-64 does use register passing. The books are available for free from AMD. they are a good read. John -- John Meacham - ⑆repetae.net⑆john⑈

On Wed, 2005-10-26 at 12:24 -0400, Jan-Willem Maessen wrote:
Does x86-64 use a register-based calling convention by default?
Yes. And the algorithm is NASTY :) See 3.2.3 of 'AMD64 ABI Draft 0.96 June 14 2005' (which can be downloaded from somewhere on the AMD website). For integers, rdi, rsi,rdx,rcx,r8,r9 are used (in that order), for floats, xmm0-xmm7 -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net

* John Meacham:
loop:
if () goto loop;
is not equivalent to a do-while loop, loop invarients cannot be hoisted out of the above for instance (except in some cases... it is all quite tricky and we want gcc to have as much freedom as possible).
do-while loops are converted to this form by the compiler (as of version 4, use the -fdump-tree-* to see the IL), so the problems you are observing must be caused by something different.
use C control constructs rather than gotos.
With GCC version 4, this will have no effect because the gimplifier converts everything to goto-style anyway.

On Tue, 2005-11-01 at 17:30 +0100, Florian Weimer wrote:
use C control constructs rather than gotos.
With GCC version 4, this will have no effect because the gimplifier converts everything to goto-style anyway.
Felix generates C with gotos. The result is FASTER than native C using gcc 4.0 on x86_64. http://felix.sourceforge.net/current/speed/en_flx_perf_0005.html C code: 4: int Ack(int M, int N) { 5: if (M==0) return N +1; 6: else if(N==0) return Ack(M-1,1); 7: else return Ack(M-1, Ack(M,N-1)); 8: } 9: It is known gcc is correctly optimising tail calls here. Felix: 5: fun ack(x:int,y:int):int => 6: if x == 0 then y + 1 7: elif y == 0 then ack(x-1, 1) 8: else ack(x-1, ack(x, y-1)) 9: endif 10: ; Felix generated C(++) code -- compiled with same options: int FLX_REGPARM _i1860_f1301_ack( int _i1864_v1303_x, int _i1865_v1304_y) { _us2 _i1867_v1799_ack_mv_74; _us2 _i1868_v1821_ack_mv_84; start_1828:; _i1867_v1799_ack_mv_74 = _i1864_v1303_x==0 ; if(!(_i1867_v1799_ack_mv_74==1))goto _1797; return _i1865_v1304_y+1 ; _1797:; _i1868_v1821_ack_mv_84 = _i1865_v1304_y==0 ; if(!(_i1868_v1821_ack_mv_84==1))goto _1798; _i1865_v1304_y = 1; _i1864_v1303_x = _i1864_v1303_x-1 ; goto start_1828; _1798:; _i1865_v1304_y = _i1860_f1301_ack(_i1864_v1303_x, _i1865_v1304_y-1 ); _i1864_v1303_x = _i1864_v1303_x-1 ; goto start_1828; } [The FLX_REGPARM says __attribute__((regparm(3)) when gcc is the compiler for i386 .. but it has no effect on x86_64] AFAICS gcc 4.x generates much better code if you just use gotos everywhere instead of C control structures. I have no real idea why the Felix generated C is faster. Two guesses: (a) the two 'mv' variables declared at the top are optimised away, so the Felix version is only using 3 words of stack. (b) the "parallel assigment in tail calls optimisation" is saving one word on the stack (evaluating y before x saves a temporary across the non-tail recursion). but I don't really know. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net

* > On Tue, 2005-11-01 at 17:30 +0100, Florian Weimer wrote:
use C control constructs rather than gotos.
With GCC version 4, this will have no effect because the gimplifier converts everything to goto-style anyway.
Felix generates C with gotos. The result is FASTER than native C using gcc 4.0 on x86_64.
Coincidence. 8-)
Felix generated C(++) code -- compiled with same options:
int FLX_REGPARM _i1860_f1301_ack( int _i1864_v1303_x, int _i1865_v1304_y) { _us2 _i1867_v1799_ack_mv_74; _us2 _i1868_v1821_ack_mv_84;
_us2 is unsigned, correct? BTW, you shouldn't generate identifiers with leading underscores because they are reserved for the implementation.
I have no real idea why the Felix generated C is faster. Two guesses:
(a) the two 'mv' variables declared at the top are optimised away, so the Felix version is only using 3 words of stack.
(b) the "parallel assigment in tail calls optimisation" is saving one word on the stack (evaluating y before x saves a temporary across the non-tail recursion).
Both variants do not use the stack. It seems that the goto-based version leads to different static branch prediction results, which happen to be favorable. If you want, try adding __builtin_expect to the if statement in both versions, and measure again.

On Tue, 2005-11-01 at 19:03 +0100, Florian Weimer wrote:
Felix generates C with gotos. The result is FASTER than native C using gcc 4.0 on x86_64.
Coincidence. 8-)
Possibly :)
Felix generated C(++) code -- compiled with same options:
int FLX_REGPARM _i1860_f1301_ack( int _i1864_v1303_x, int _i1865_v1304_y) { _us2 _i1867_v1799_ack_mv_74; _us2 _i1868_v1821_ack_mv_84;
_us2 is unsigned, correct?
No, actually it means 'unit sum', in this case the sum of two units (hence us2). Felix has a special representation for unit sums .. it uses an int. And of course, 1 + 1 = 2 is just another name for 'bool'. These two variables are 'mv's or 'match variables' -- the arguments of the two matches which are generated by the 'if then elif else endif' sugar. Match expressions are stuck into variables so they can be destructed by pattern matching (even though in this case there is none :)
BTW, you shouldn't generate identifiers with leading underscores because they are reserved for the implementation.
I AM the implementation :) Generated Identifiers start with underscores, so they don't clash with arbitrary C code.
I have no real idea why the Felix generated C is faster. Two guesses:
(a) the two 'mv' variables declared at the top are optimised away, so the Felix version is only using 3 words of stack.
(b) the "parallel assigment in tail calls optimisation" is saving one word on the stack (evaluating y before x saves a temporary across the non-tail recursion).
Both variants do not use the stack.
This code evaluates y first, then x: _i1865_v1304_y = _i1860_f1301_ack(_i1864_v1303_x, _i1865_v1304_y-1 ); _i1864_v1303_x = _i1864_v1303_x-1 ; If you change the order you get this: { int tmp = _i1864_v1303_x; _i1864_v1303_x = _i1864_v1303_x-1 ; _i1865_v1304_y = _i1860_f1301_ack(tmp, _i1865_v1304_y-1 ); } which as written uses one extra word on the stack for 'tmp' up until the closing } .. which means an extra word during the recursive call.
It seems that the goto-based version leads to different static branch prediction results, which happen to be favorable.
It has nothing to do with branch prediction. I know it is determined ENTIRELY by stack use. I rewrote the assembler version many ways to try to find out how to improve my compiler.. nothing made the slightest difference except stack usage. The performance relativities of various generated codes depend only on the difference in stack pointers at the same place in the code, as a function of recursion depth. This is obvious when you think about it: the CPU operations (add, subtract, branch, etc) are orders of magnitude faster than memory accesses -- including cache accesses -- and the calculations in ackermann are trivial (add 1, subtract 1.. :) Accessing the cache costs. Accessing RAM costs even more. And I even drove this function so hard on my old i386 I filled all of RAM and pushed it into disk paging. hehe .. just before publishing the latest Felix I had a look at the performance graph and .. shock horror .. gcc and ocaml were creaming it. I had accidentally disabled the optimisation which replaced the creation of an C++ class object with a C function. The latter not only has no 'this' pointer, it also takes arguments 'one by one' instead of a reference to a tuple which is unpacked in the apply method of the class .. thats a lot of extra words on the stack. The graph you see, of course, has that little problem fixed :) Ackermann eats memory .. and doesn't do any other serious work. That's the point of this function as a microbenchmark: everything depends on efficient stack usage (and nothing else). I actually know how to optimise this function MUCH more: all the way down to ONE word of stack. The first optimisation eliminates x by unrolling -- x starts at 3 and is never incremented (in the Shootout test, initial x is fixed at 3). That eliminates one word. The second optimisation is secret <g> but also eliminates one word at the cost of one register. I guess the bottom line is: CPU performance is hard to predict, and what optimisations gcc will do are also hard to predict. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net

On Wed, 2 Nov 2005, skaller wrote:
On Tue, 2005-11-01 at 19:03 +0100, Florian Weimer wrote:
BTW, you shouldn't generate identifiers with leading underscores because they are reserved for the implementation.
I AM the implementation :)
You are not the C implementation.
Generated Identifiers start with underscores, so they don't clash with arbitrary C code.
You should prefix them with something else, e.g. felix_.
Tony.
--
f.a.n.finch

It seems that the goto-based version leads to different static branch prediction results, which happen to be favorable.
It has nothing to do with branch prediction. I know it is determined ENTIRELY by stack use.
In both cases, The C compiler emits code which doesn't use the stack.

On Wed, 2005-11-02 at 19:47 +0100, Florian Weimer wrote:
It seems that the goto-based version leads to different static branch prediction results, which happen to be favorable.
It has nothing to do with branch prediction. I know it is determined ENTIRELY by stack use.
In both cases, The C compiler emits code which doesn't use the stack.
huh? how can a recursive call not use the stack?? -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net
participants (5)
-
Florian Weimer
-
Jan-Willem Maessen
-
John Meacham
-
skaller
-
Tony Finch