
On Thu, Oct 27, 2005 at 10:38:20AM +0100, Simon Marlow wrote:
gcc started generating this rubbish around version 3.4, if I recall correctly. I've tried disabling various optimisations, but can't seem to convince gcc not to generate the extra jump. You don't get this from the native code generator, BTW.
Actually, our NCG beats gcc in most cases on x86_64, I believe.
This doesn't surprise me. gcc seems to completly drop the ball on ghc generated C.
Sounds like a good idea. In fact, we should do all this as a C-- -> C-- optimisation, as Simon PJ pointed out (and I mentioned on IRC yesterday).
sounds good to me :) that should help all the back ends.
I don't actually understand why this helps, but it's easy enough to do.
I think it is just that many of the optimizations gcc do expect idiomatic C and are just disabled when presented with 'odd' C. I believe the use of global registers is also inhibiting some optimizations. even with -O3 it doesn't recognize that indirect jump is uneeded, if they think indirect jumps are rare enough to not optimize that I wonder what else they don't bother to optimize in their presence.
* getting stack dereferences out of your loops.
A bit more tricky, but could still be done as C-- to C--.
Note that GHC's back end is really aimed at producing good code when there are registers available for passing arguments - this isn't true on x86 or x86_64 at the moment, though.
Hrm? why are registers not available on x86_64? I thought it had a plethora. (compared to the i386)
These should be straightforward to implement in the C code generator. it also suggests we might want to try to use the native C calling convention on leaf nodes that deal with unboxed values (so we get register passing and return values for free) or taking groups of mutually recursive functions and turning them all into one function with explicit jumps between them.
Using the real C calling convention brings some problems - we'd have to use setjmp/longjmp to pass control back to the scheduler or garbage collector.
I was thinking something like the worker/wrapper split, ghc would recognize when a function takes only unboxed arguments and returns an unboxed result (these can probably be relaxed, no evals is the key thing) so in the case of fac, it would create int fac(int n, int r) { if (n == 1) return 1; return fac (n - 1,n*r); } and (something like) void fac_wrapper(void) { continuation = pop() // I might be mixing up the order of these n = pop() r = pop() x = fac(n,r) push(x) jump(continuation) } fac_wrapper will used when it is passed as a higher order function, but direct calls to fac (such as in a case scrutinee) will call fac directly. I am not sure how much sense this makes though. I am no expert on the spineless tagless G machine (which would make an excellent name for a band BTW) Perusing the assembly I think I see another place for improvement when it comes to updates. the issue involves the way the stack is unwound, a bunch of recursive calls build up, the functions return unwinding the stack hitting updates. now, what happens in a memory write is that the cache line is fully read in, the change made, and the line is marked as 'modified' so when it is eventually evicted, it gets written to ram rather than being discarded. however, in the case of updating thunks, this is bad for a number of reasons, unwinding the stack is going to touch a lot of thunks, each one evicting a needed cache line from memory. This is particularly bad because the thunk most likely will not need to be read again any time soon. if a function needed the value again, it would have kept a pointer to the WHNF around, moreso, chances are said thunk will _never_ be read again since most thunks are not reentered. even if it is entered, chances are even higher that it will never be written to again (except for indirection short circuiting), so having a modified cache line taking up space is really bad. fortunatly, modern CPUs anticipate this conondrum and provide 'write-combining' forms of their memory access functions, these will write a value directly to RAM without touching the cache at all. This will always be a win when updating thunks due to the reasons mentioned above and is potentially a big benefit. selective write-combining is in the top 3 performance enhancing things according to the cpu optimization manuals. I think the easiest way to do this would be to have a MACRO defined to an appropriate bit of assembly or a simple C assignment if the write-combining mov's arn't available. John -- John Meacham - ⑆repetae.net⑆john⑈