Re: jhc vs ghc and the surprising result involving ghc generatedassembly.

27 Oct 2005

      On Thu, Oct 27, 2005 at 10:38:20AM +0100, Simon Marlow wrote:
...
gcc started generating this rubbish around version 3.4, if I recall
correctly.  I've tried disabling various optimisations, but can't seem
to convince gcc not to generate the extra jump.  You don't get this from
the native code generator, BTW.
Actually, our NCG beats gcc in most cases on x86_64, I believe.
This doesn't surprise me. gcc seems to completly drop the ball on ghc
generated C.
...
Sounds like a good idea.  In fact, we should do all this as a C-- -> C--
optimisation, as Simon PJ pointed out (and I mentioned on IRC
yesterday).
sounds good to me :) that should help all the back ends.
...
I don't actually understand why this helps, but it's easy enough to do.
I think it is just that many of the optimizations gcc do expect
idiomatic C and are just disabled when presented with 'odd' C. I
believe the use of global registers is also inhibiting  some
optimizations. even with -O3 it doesn't recognize that indirect jump is
uneeded, if they think indirect jumps are rare enough to not optimize
that I wonder what else they don't bother to optimize in their
presence.
...
...
* getting stack dereferences out of your loops.
A bit more tricky, but could still be done as C-- to C--.
Note that GHC's back end is really aimed at producing good code when
there are registers available for passing arguments - this isn't true on
x86 or x86_64 at the moment, though.
Hrm? why are registers not available on x86_64? I thought it had a
plethora. (compared to the i386)
...
...
These should be straightforward to implement in the C code generator.
it also
suggests we might want to try to use the native C calling convention
on leaf
nodes that deal with unboxed values (so we get register passing and
return
values for free) or taking groups of mutually recursive functions and
turning
them all into one function with explicit jumps between them.
Using the real C calling convention brings some problems - we'd have to
use setjmp/longjmp to pass control back to the scheduler or garbage
collector.
I was thinking something like the worker/wrapper split, ghc would
recognize when a function takes only unboxed arguments and returns an
unboxed result (these can probably be relaxed, no evals is the key
thing)

so in the case of fac, it would create

int fac(int n, int r) {
        if (n == 1) return 1;
        return fac (n - 1,n*r);
}

and (something like)

void fac_wrapper(void) {
continuation = pop()   // I might be mixing up the order of these
n = pop()
r = pop()

x = fac(n,r)

push(x)
jump(continuation) 

}

fac_wrapper will used when it is passed as a higher order function, but
direct calls to fac (such as in a case scrutinee) will call fac
directly.

I am not sure how much sense this makes though. I am no expert on the
spineless tagless G machine  (which would make an excellent name for a
band BTW)

Perusing the assembly I think I see another place for improvement when
it comes to updates.

the issue involves the way the stack is unwound, a bunch of recursive
calls build up, the functions return unwinding the stack hitting
updates. 

now, what happens in a memory write is that the cache line is fully
read in, the change made, and the line is marked as 'modified' so when
it is eventually evicted, it gets written to ram rather than being
discarded. however, in the case of updating thunks, this is bad for a
number of reasons, unwinding the stack is going to touch a lot of
thunks, each one evicting a needed cache line from memory. This is
particularly bad because the thunk most likely will not need to be read
again any time soon. if a function needed the value again, it would have
kept a pointer to the WHNF around, moreso, chances are said thunk will
_never_ be read again since most thunks are not reentered. even if it is
entered, chances are even higher that it will never be written to again
(except for indirection short circuiting), so having a modified cache
line taking up space is really bad. 

fortunatly, modern CPUs anticipate this conondrum and provide
'write-combining' forms of their memory access functions, these will
write a value directly to RAM without touching the cache at all. This
will always be a win when updating thunks due to the reasons mentioned
above and is potentially a big benefit. selective write-combining is in
the top 3 performance enhancing things according to the cpu
optimization manuals.

I think the easiest way to do this would be to have a MACRO defined to
an appropriate bit of assembly or a simple C assignment if the
write-combining mov's arn't available.

        John 

-- 
John Meacham - ⑆repetae.net⑆john⑈