RE: jhc vs ghc and the surprising result involving ghcgeneratedassembly.

On 27 October 2005 12:12, John Meacham wrote:
Note that GHC's back end is really aimed at producing good code when there are registers available for passing arguments - this isn't true on x86 or x86_64 at the moment, though.
Hrm? why are registers not available on x86_64? I thought it had a plethora. (compared to the i386)
mutter mutter... a bunch of the registers are reserved for argument passing in the C calling convention, and when I tried to steal them I ran into trouble around foreign calls. It should/might be possible to work around this, I need to have another go. It works fine with the NCG, of course.
I was thinking something like the worker/wrapper split, ghc would recognize when a function takes only unboxed arguments and returns an unboxed result (these can probably be relaxed, no evals is the key thing)
so in the case of fac, it would create
int fac(int n, int r) { if (n == 1) return 1; return fac (n - 1,n*r); }
and (something like)
void fac_wrapper(void) { continuation = pop() // I might be mixing up the order of these n = pop() r = pop()
x = fac(n,r)
push(x) jump(continuation)
}
Well yes, but if the worker needs to return to the scheduler (i.e. if it does a heap check or stack check) then the C stack is all messed up and we need a setjmp/longjmp to get back to the scheduler. You can do it in the case where there are no heap/stack checks, but I think that's very rare.
I am not sure how much sense this makes though. I am no expert on the spineless tagless G machine (which would make an excellent name for a band BTW)
:-D
fortunatly, modern CPUs anticipate this conondrum and provide 'write-combining' forms of their memory access functions, these will write a value directly to RAM without touching the cache at all. This will always be a win when updating thunks due to the reasons mentioned above and is potentially a big benefit. selective write-combining is in the top 3 performance enhancing things according to the cpu optimization manuals.
I think the easiest way to do this would be to have a MACRO defined to an appropriate bit of assembly or a simple C assignment if the write-combining mov's arn't available.
very good idea, I must try that. Any more progress on why our x86_64 code is slow? Cheers, Simon
participants (1)
-
Simon Marlow