
marlowsd:
I manged to improve this:
Main_mainzuzdszdwfold_info: .Lc1lP: addq $32,%r12 cmpq 144(%r13),%r12 ja .Lc1lS movq %r14,%rax cmpq $1000000000,%rax jne .Lc1lV movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12) movsd %xmm6,-16(%r12) movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12) movsd %xmm5,(%r12) leaq -7(%r12),%rbx leaq -23(%r12),%r14 jmp *(%rbp) .Lc1lS: movq $32,184(%r13) movl $Main_mainzuzdszdwfold_closure,%ebx addq $-24,%rbp movsd %xmm5,(%rbp) movsd %xmm6,8(%rbp) movq %r14,16(%rbp) jmp *-8(%r13) .Lc1lV: addsd .Ln1m2(%rip),%xmm5 addsd .Ln1m3(%rip),%xmm6 leaq 1(%rax),%r14 addq $-32,%r12 jmp Main_mainzuzdszdwfold_info
from 9 instructions in the last block down to 5 (one instruction fewer than gcc). I haven't commoned up the two constant 1's though, that'd mean doing some CSE.
On my machine with GHC HEAD and gcc 4.3.0, the gcc version runs in 2.0s, with the NCG at 2.3s. I put the difference down to a bit of instruction scheduling done by gcc, and that extra constant load.
But let's face it, all of this code is crappy. It should be a tiny little loop rather than a tail-call with argument passing, and that's what we'll get with the new backend (eventually). LLVM probably won't turn it into a loop on its own, that needs to be done before the code gets passed to LLVM.
Agreed. Ideally the new backend would be (starting to be?) usable about the time -fvia-C dies? Otherwise there's always going to be something that gcc spots that the current codegen won't. Then again, killing perl from the ghc toolchain, and having a funeral/dancing on its grave, would be satisfying in itself :-)
Have you looked at this example on x86? It's *far* worse and runs about 5 times slower.
x86 scares me.. :)