RE: jhc vs ghc and the surprising result involving ghc generatedassembly.

John, this is great stuff. There's clearly a lot we can do to improve GHC's code, at least for fac :-) (I'd be really interested in any numbers you have for larger programs too, eg. the nofib suite).
and now the generated assembly.
Main_zdwfac_info: .text .align 8 .text movq (%rbp), %rdx cmpq $1, %rdx jne .L2 movq 8(%rbp), %r13 leaq 16(%rbp), %rbp movq (%rbp), %rax .L4: jmp *%rax
gcc started generating this rubbish around version 3.4, if I recall correctly. I've tried disabling various optimisations, but can't seem to convince gcc not to generate the extra jump. You don't get this from the native code generator, BTW. Actually, our NCG beats gcc in most cases on x86_64, I believe.
.L2: movq %rdx, %rax imulq 8(%rbp), %rax movq %rax, 8(%rbp) leaq -1(%rdx), %rax movq %rax, (%rbp) movl $Main_zdwfac_info, %eax jmp .L4
[snip]
A couple simple rules seem to help greatly.
* turn anything of the form JMP_((W_)&self) where self is oneself into a goto that gotos a label at the beginning of the function.
Sounds like a good idea. In fact, we should do all this as a C-- -> C-- optimisation, as Simon PJ pointed out (and I mentioned on IRC yesterday). one small caveat: our register allocator can't handle loops at the moment, so any optimisations that generated actual loops would have to be disabled for the NCG for now.
* do simple pattern matthing on the basic blocks to recognize where C control constructs can be placed.
for instance turn
if (x) { goto y; } blah.. baz.. JMP_(foo)
into
if (x) { goto y; } else { blah.. baz.. JMP_(foo) }
I don't actually understand why this helps, but it's easy enough to do.
* getting stack dereferences out of your loops.
A bit more tricky, but could still be done as C-- to C--. Note that GHC's back end is really aimed at producing good code when there are registers available for passing arguments - this isn't true on x86 or x86_64 at the moment, though.
1. fully convert it to use C control constructs, so gcc will do it for us. (code motion and loop invarient being inhibited again by gotos)
Ok. Jan's do-while idea seems like a cheap way to achieve this.
These should be straightforward to implement in the C code generator. it also suggests we might want to try to use the native C calling convention on leaf nodes that deal with unboxed values (so we get register passing and return values for free) or taking groups of mutually recursive functions and turning them all into one function with explicit jumps between them.
Using the real C calling convention brings some problems - we'd have to use setjmp/longjmp to pass control back to the scheduler or garbage collector. Cheers, Simon

On Thu, Oct 27, 2005 at 10:38:20AM +0100, Simon Marlow wrote:
gcc started generating this rubbish around version 3.4, if I recall correctly. I've tried disabling various optimisations, but can't seem to convince gcc not to generate the extra jump. You don't get this from the native code generator, BTW.
Actually, our NCG beats gcc in most cases on x86_64, I believe.
This doesn't surprise me. gcc seems to completly drop the ball on ghc generated C.
Sounds like a good idea. In fact, we should do all this as a C-- -> C-- optimisation, as Simon PJ pointed out (and I mentioned on IRC yesterday).
sounds good to me :) that should help all the back ends.
I don't actually understand why this helps, but it's easy enough to do.
I think it is just that many of the optimizations gcc do expect idiomatic C and are just disabled when presented with 'odd' C. I believe the use of global registers is also inhibiting some optimizations. even with -O3 it doesn't recognize that indirect jump is uneeded, if they think indirect jumps are rare enough to not optimize that I wonder what else they don't bother to optimize in their presence.
* getting stack dereferences out of your loops.
A bit more tricky, but could still be done as C-- to C--.
Note that GHC's back end is really aimed at producing good code when there are registers available for passing arguments - this isn't true on x86 or x86_64 at the moment, though.
Hrm? why are registers not available on x86_64? I thought it had a plethora. (compared to the i386)
These should be straightforward to implement in the C code generator. it also suggests we might want to try to use the native C calling convention on leaf nodes that deal with unboxed values (so we get register passing and return values for free) or taking groups of mutually recursive functions and turning them all into one function with explicit jumps between them.
Using the real C calling convention brings some problems - we'd have to use setjmp/longjmp to pass control back to the scheduler or garbage collector.
I was thinking something like the worker/wrapper split, ghc would recognize when a function takes only unboxed arguments and returns an unboxed result (these can probably be relaxed, no evals is the key thing) so in the case of fac, it would create int fac(int n, int r) { if (n == 1) return 1; return fac (n - 1,n*r); } and (something like) void fac_wrapper(void) { continuation = pop() // I might be mixing up the order of these n = pop() r = pop() x = fac(n,r) push(x) jump(continuation) } fac_wrapper will used when it is passed as a higher order function, but direct calls to fac (such as in a case scrutinee) will call fac directly. I am not sure how much sense this makes though. I am no expert on the spineless tagless G machine (which would make an excellent name for a band BTW) Perusing the assembly I think I see another place for improvement when it comes to updates. the issue involves the way the stack is unwound, a bunch of recursive calls build up, the functions return unwinding the stack hitting updates. now, what happens in a memory write is that the cache line is fully read in, the change made, and the line is marked as 'modified' so when it is eventually evicted, it gets written to ram rather than being discarded. however, in the case of updating thunks, this is bad for a number of reasons, unwinding the stack is going to touch a lot of thunks, each one evicting a needed cache line from memory. This is particularly bad because the thunk most likely will not need to be read again any time soon. if a function needed the value again, it would have kept a pointer to the WHNF around, moreso, chances are said thunk will _never_ be read again since most thunks are not reentered. even if it is entered, chances are even higher that it will never be written to again (except for indirection short circuiting), so having a modified cache line taking up space is really bad. fortunatly, modern CPUs anticipate this conondrum and provide 'write-combining' forms of their memory access functions, these will write a value directly to RAM without touching the cache at all. This will always be a win when updating thunks due to the reasons mentioned above and is potentially a big benefit. selective write-combining is in the top 3 performance enhancing things according to the cpu optimization manuals. I think the easiest way to do this would be to have a MACRO defined to an appropriate bit of assembly or a simple C assignment if the write-combining mov's arn't available. John -- John Meacham - ⑆repetae.net⑆john⑈

On Thursday 27 October 2005 13:11, John Meacham wrote:
I am not sure how much sense this makes though. I am no expert on the spineless tagless G machine (which would make an excellent name for a band BTW)
:-))))) Ben

* Simon Marlow:
gcc started generating this rubbish around version 3.4, if I recall correctly. I've tried disabling various optimisations, but can't seem to convince gcc not to generate the extra jump. You don't get this from the native code generator, BTW.
But the comparison is present in the C code. What do you want GCC to do?

On Tue, Nov 01, 2005 at 05:32:29PM +0100, Florian Weimer wrote:
* Simon Marlow:
gcc started generating this rubbish around version 3.4, if I recall correctly. I've tried disabling various optimisations, but can't seem to convince gcc not to generate the extra jump. You don't get this from the native code generator, BTW.
But the comparison is present in the C code. What do you want GCC to do?
in addition to the expected conditional jump, it is adding another unconditional jump and an indirect jump to a constant. John -- John Meacham - ⑆repetae.net⑆john⑈
participants (4)
-
Benjamin Franksen
-
Florian Weimer
-
John Meacham
-
Simon Marlow