
On Wed, Oct 26, 2005 at 12:24:14PM -0400, Jan-Willem Maessen wrote:
Nice analysis. I indeed found with phc that shadow stack references absolutely killed performance, and I aggressively cached stack locations in locals, spilling to stack only when GC information needed to be accurate. [There was a giant infrastructure to save only live data to stack, but we won't go into that now as it was the source of almost all the codegen bugs...]
phc? there is another haskell compiler out there?
This makes a big difference. The phc compiler even put comments in the code so that I could figure out what came from where.
yeah, that is something I would like to add. Unfortunatly I wasn't forward thinking enough to put annotation points everywhere I might eventually need them so I will have to go through and do that at some point :)
I'm impressed that gcc found this. It's definitely living a bit dangerously, and your suggestions below for self tail call handling are the ones I found most effective. (They also allowed me to bypass some prologue garbage, since phc used a one-C-function-per-Haskell- function model with internal resumption points.) Non-self tail calls I was careful to compile to: return f(...); I expect from the above that gcc does better at spotting tail calls now.
indeed. it is actually quite good at spotting tail calls, I thought it would be an issue but it has caught the important ones in generated code, and even done invarient hoisting out of the loop! it is not evident from the x86-64 assembly I posted since stuff comes in registers to begin with, but in the i386 it sets up everything outside the main loop and has the same memory access free little inner loop. little things like storing stuff in temporary variables doesn't seem to confuse gcc. I think this describes the new algorithm.. but am unsure. http://home.in.tum.de/~baueran/thesis/ it is interesting that they do it to support ghc, but ghc uses explicit continuations so compiler based tail calls arn't as important?
furthermore gotos and labels are very problematic for gcc to optimize around. for various tiresome reasons gcc cannot perform (most) code motion optimizations across explicit labels and gotos, especially when they deal with the global register variables and memory stores and loads. ...
there are a couple of things we can do to mitigate these problems:
get rid of indirect jumps whenever possible.
use C control constructs rather than gotos.
"for" loop introduction would be especially nice, but a bit tricky in practice I fear (requiring a game of "spot the induction variable").
yeah, but do and while loops should be easy enough. just look for basic blocks that point back to somewhere above them and have a single other path out of them... and if could be used whenever there are exactly two code paths out of a given block.
A couple simple rules seem to help greatly.
* turn anything of the form JMP_((W_)&self) where self is oneself into a goto that gotos a label at the beginning of the function.
Or better yet, wrap the whole function in
do { } while (1);
and replace "JMP_" by "continue". This avoids the troubles with goto which John mentioned above. It made a difference for phc, at least. Of course, if you can introduce loops elsewhere you might get yourself into trouble with this solution.
* do simple pattern matthing on the basic blocks to recognize where C control constructs can be placed.
for instance turn
if (x) { goto y; } blah.. baz.. JMP_(foo)
into
if (x) { goto y; } else { blah.. baz.. JMP_(foo) }
extending the else to after the next jump or goto.
I'm surprised this actually helps, I must admit.
yeah, me too. but it seems to. I think while gcc is very good at compiling idiomatic C, when you start doing things tricky like using goto's it just shuts off a lot of its optimizations rather than figuring out how to work around them since it is not really vital for most any C programs.
* getting stack dereferences out of your loops.
I recommend caching stack references in C locals where possible, but it's tricky to get this right if loop bodies include embedded function calls. Last I checked this wasn't an issue for GHC, since function calls were CPS-converted and only tight call-free loops ended up in a single function anyway.
I have a sneaking suspicion gcc might be treating global register variables as if they were volatile. meaning it must treat them as if they could be modified by external code at any time. now, usually that doesn't matter for a register since it doesn't need to be loaded to and from memory. however it means that every memory dereference relative to it will have to be done anew each time. and since a whole lot of ghc's goings on is dereferencing things relative to the stack, this is a major issue. I think it actually might actually be better to just pass the global register variables as standard arguments to every (non-leaf) function, This would have a few advantages i think. on machines with a register passing arch (like x86_64) it will be effectivly the same as having a global register, except the compiler will be free to spill them to the stack like normal and not inhibit any normal optimizations. in addition the register would be available for reuse in leaf-functions automatically since they need not keep track of the stack. I think.
in order to get rid of the unessesary memory accesess, we need to either
1. fully convert it to use C control constructs, so gcc will do it for us. (code motion and loop invarient being inhibited again by gotos)
As I recall, the "right" solution here is to compute dominator trees, and coalesce functions which are only tail called from their dominator into a single function. Alas, I've forgotten where I saw this written up, but there are indeed papers on it. Because it takes a bunch of effort on the part of the implementor, it'd be nice to see if its benefits are quantified.
yeah, I think that is the right way to. I don't think it would be a whole lot of effort. it is a fairly localized optimization and these sort of algorithms are usually quite easy to express in haskell.
These should be straightforward to implement in the C code generator. it also suggests we might want to try to use the native C calling convention on leaf nodes that deal with unboxed values (so we get register passing and return values for free) or taking groups of mutually recursive functions and turning them all into one function with explicit jumps between them.
Making sure things are marked "static" and occur in an appropriate dependency order helps a bit here. It might even be worth marking some stuff "inline" in the code generator, though that's shaky ground.
I actually considered making everything static and putting outwardly- visible functionality in an extern wrapper---effectively carrying worker-wrapper down to the C level.
This is what I do in jhc if I understand you. everything but main (and FFI exported functions) are static. gcc actually has a -funit-at-a-time now which will read in the entire source file and perform whole-module analysis and optimization, so there isn't a need to get the ordering of functions just right and it can be cleverer when it comes to mutually recursive functions.
some random notes:
the 3x-7x factor was tested on an i386, on x86_64 the margin is much much greater for reasons that are still unclear.
Does x86-64 use a register-based calling convention by default? If you compiled the i386 code using __regparm(2), would you see the same speed difference?
that is a good question. not sure. I will test it out. the x86-64 does use register passing. The books are available for free from AMD. they are a good read. John -- John Meacham - ⑆repetae.net⑆john⑈