Re: jhc vs ghc and the surprising result involving ghc generated assembly.

26 Oct 2005

      On Wed, Oct 26, 2005 at 12:24:14PM -0400, Jan-Willem Maessen wrote:
...
Nice analysis.  I indeed found with phc that shadow stack references  
absolutely killed performance, and I aggressively cached stack  
locations in locals, spilling to stack only when GC information  
needed to be accurate.  [There was a giant infrastructure to save  
only live data to stack, but we won't go into that now as it was the  
source of almost all the codegen bugs...]
phc? there is another haskell compiler out there?
...
This makes a big difference.  The phc compiler even put comments in  
the code so that I could figure out what came from where.
yeah, that is something I would like to add. Unfortunatly I wasn't
forward thinking enough to put annotation points everywhere I might
eventually need them so I will have to go through and do that at some
point :)
...
I'm impressed that gcc found this.  It's definitely living a bit  
dangerously, and your suggestions below for self tail call handling  
are the ones I found most effective.  (They also allowed me to bypass  
some prologue garbage, since phc used a one-C-function-per-Haskell- 
function model with internal resumption points.)  Non-self tail calls  
I was careful to compile to:
  return f(...);
I expect from the above that gcc does better at spotting tail calls now.
indeed. it is actually quite good at spotting tail calls, I thought it
would be an issue but it has caught the important ones in generated
code, and even done invarient hoisting out of the loop! it is not
evident from the x86-64 assembly I posted since stuff comes in
registers to begin with, but in the i386 it sets up everything outside
the main loop and has the same memory access free little inner loop.

little things like storing stuff in temporary variables doesn't seem to
confuse gcc.

I think this describes the new algorithm.. but am unsure.
 http://home.in.tum.de/~baueran/thesis/

it is interesting that they do it to support ghc, but ghc uses explicit
continuations so compiler based tail calls arn't as important?
...
...
furthermore gotos and labels are very problematic for gcc to  
optimize around.
for various tiresome reasons gcc cannot perform (most) code motion
optimizations across explicit labels and gotos, especially when  
they deal with
the global register variables and memory stores and loads. ...
there are a couple of things we can do to mitigate these problems:
get rid of indirect jumps whenever possible.
use C control constructs rather than gotos.
"for" loop introduction would be especially nice, but a bit tricky in  
practice I fear (requiring a game of "spot the induction variable").
yeah, but do and while loops should be easy enough. just look for basic
blocks that point back to somewhere above them and have a single other
path out of them... and if could be used whenever there are exactly two
code paths out of a given block.
...
...
A couple simple rules seem to help greatly.
* turn anything of the form JMP_((W_)&self) where self is oneself  
into a goto
that gotos a label at the beginning of the function.
Or better yet, wrap the whole function in
do {
} while (1);
and replace "JMP_" by "continue".  This avoids the troubles with goto  
which John mentioned above.  It made a difference for phc, at least.   
Of course, if you can introduce loops elsewhere you might get  
yourself into trouble with this solution.
...
* do simple pattern matthing on the basic blocks to recognize where  
C control
constructs can be placed.
for instance turn
if (x) { goto  y; }
blah..
baz..
JMP_(foo)
into
if (x) { goto  y; } else {
blah..
baz..
JMP_(foo)
}
extending the else to after the next jump or goto.
I'm surprised this actually helps, I must admit.
yeah, me too. but it seems to. I think while gcc is very good at
compiling idiomatic C, when you start doing things tricky like using
goto's it just shuts off a lot of its optimizations rather than
figuring out how to work around them since it is not really vital for
most any C programs.
...
...
* getting stack dereferences out of your loops.
I recommend caching stack references in C locals where possible, but  
it's tricky to get this right if loop bodies include embedded  
function calls.  Last I checked this wasn't an issue for GHC, since  
function calls were CPS-converted and only tight call-free loops  
ended up in a single function anyway.
I have a sneaking suspicion gcc might be treating global register
variables as if they were volatile. meaning it must treat them as if
they could be modified by external code at any time. now, usually that
doesn't matter for a register since it doesn't need to be loaded to and
from memory. however it means that every memory dereference relative to
it will have to be done anew each time. and since a whole lot of ghc's
goings on is dereferencing things relative to the stack, this is a
major issue. 

I think it actually might actually be better to just pass the global
register variables as standard arguments to every (non-leaf) function,

This would have a few advantages i think. on machines with a register
passing arch (like x86_64) it will be effectivly the same as having a
global register, except the compiler will be free to spill them to the
stack like normal and not inhibit any normal optimizations. in addition
the register would be available for reuse in leaf-functions
automatically since they need not keep track of the stack. I think.
...
...
in order to get rid of the unessesary memory accesess, we need to  
either
1. fully convert it to use C control constructs, so gcc will do it  
for us.
(code motion and loop invarient being inhibited again by gotos)
As I recall, the "right" solution here is to compute dominator trees,  
and coalesce functions which are only tail called from their  
dominator into a single function.  Alas, I've forgotten where I saw  
this written up, but there are indeed papers on it.  Because it takes  
a bunch of effort on the part of the implementor, it'd be nice to see  
if its benefits are quantified.
yeah, I think that is the right way to. I don't think it would be a
whole lot of effort. it is a fairly localized optimization and these
sort of algorithms are usually quite easy to express in haskell.
...
...
These should be straightforward to implement in the C code  
generator. it also
suggests we might want to try to use the native C calling  
convention on leaf
nodes that deal with unboxed values (so we get register passing and  
return
values for free) or taking groups of mutually recursive functions  
and turning
them all into one function with explicit jumps between them.
Making sure things are marked "static" and occur in an appropriate  
dependency order helps a bit here.  It might even be worth marking  
some stuff "inline" in the code generator, though that's shaky ground.
I actually considered making everything static and putting outwardly- 
visible functionality in an extern wrapper---effectively carrying  
worker-wrapper down to the C level.
This is what I do in jhc if I understand you. everything but main (and
FFI exported functions) are static. 

gcc actually has a -funit-at-a-time now which will read in the entire
source file and perform whole-module analysis and optimization, so there
isn't a need to get the ordering of functions just right and it can be
cleverer when it comes to mutually recursive functions.
...
...
some random notes:
the 3x-7x factor was tested on an i386, on x86_64 the margin is  
much much
greater for reasons that are still unclear.
Does x86-64 use a register-based calling convention by default?  If  
you compiled the i386 code using __regparm(2), would you see the same  
speed difference?
that is a good question. not sure. I will test it out. the x86-64 does
use register passing. The books are available for free from AMD. they
are a good read.

        John

-- 
John Meacham - ⑆repetae.net⑆john⑈