RE: jhc vs ghc and the surprising result involving ghcgeneratedassembly.

27 Oct 2005


      On 27 October 2005 12:12, John Meacham wrote:
...
...
Note that GHC's back end is really aimed at producing good code when
there are registers available for passing arguments - this isn't
true on x86 or x86_64 at the moment, though.
Hrm? why are registers not available on x86_64? I thought it had a
plethora. (compared to the i386)
mutter mutter... a bunch of the registers are reserved for argument
passing in the C calling convention, and when I tried to steal them I
ran into trouble around foreign calls.  It should/might be possible to
work around this, I need to have another go.  It works fine with the
NCG, of course.
...
I was thinking something like the worker/wrapper split, ghc would
recognize when a function takes only unboxed arguments and returns an
unboxed result (these can probably be relaxed, no evals is the key
thing)
so in the case of fac, it would create
int fac(int n, int r) {
        if (n == 1) return 1;
        return fac (n - 1,n*r);
}
and (something like)
void fac_wrapper(void) {
continuation = pop()   // I might be mixing up the order of these
n = pop()
r = pop()
x = fac(n,r)
push(x)
jump(continuation)
}
Well yes, but if the worker needs to return to the scheduler (i.e. if it
does a heap check or stack check) then the C stack is all messed up and
we need a setjmp/longjmp to get back to the scheduler.  You can do it in
the case where there are no heap/stack checks, but I think that's very
rare.
...
I am not sure how much sense this makes though. I am no expert on the
spineless tagless G machine  (which would make an excellent name for a
band BTW)
:-D
...
fortunatly, modern CPUs anticipate this conondrum and provide
'write-combining' forms of their memory access functions, these will
write a value directly to RAM without touching the cache at all. This
will always be a win when updating thunks due to the reasons mentioned
above and is potentially a big benefit. selective write-combining is
in the top 3 performance enhancing things according to the cpu
optimization manuals.
I think the easiest way to do this would be to have a MACRO defined to
an appropriate bit of assembly or a simple C assignment if the
write-combining mov's arn't available.
very good idea, I must try that.  Any more progress on why our x86_64
code is slow?

Cheers,
	Simon

Simon Marlow

tags

participants (1)