
marlowsd:
Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first?
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
As recently as last year -fvia-C -optc-O3 was still useful for some microbenchmarks -- what's changed in that time, or is expected to change?
If you have benchmarks that show a significant difference, I'd be interested to see them.
I've attached an example where there's a 40% variation (and it's a floating point benchmark). Roman would be seeing similar examples in the vector code. I'm all in favor of dropping the C backend, but I'm also wary that we don't have benchmarks to know what difference it is making. Here's a simple program testing a tight, floating point loop: import Data.Array.Vector import Data.Complex main = print . sumU $ replicateU (1000000000 :: Int) (1 :+ 1 ::Complex Double) Compiled with ghc 6.12, uvector-0.1.1.0 on a 64 bit linux box. The -fvia-C -optc-O3 is about 40% faster than -fasm. How does it fair with the new sse patches? I've attached the assembly below for each case.. -- Don ------------------------------------------------------------------------ Fastest. 2.17s. About 40% faster than -fasm $ time ./sum-complex 1.0e9 :+ 1.0e9 ./sum-complex 2.16s user 0.00s system 99% cpu 2.175 total Main_mainzuzdszdwfold_info: leaq 32(%r12), %rax movq %r12, %rdx cmpq 144(%r13), %rax movq %rax, %r12 ja .L4 cmpq $1000000000, %r14 je .L9 .L5: movsd .LC0(%rip), %xmm0 leaq 1(%r14), %r14 addsd %xmm0, %xmm5 addsd %xmm0, %xmm6 movq %rdx, %r12 jmp Main_mainzuzdszdwfold_info .L4: leaq -24(%rbp), %rax movq $32, 184(%r13) movq %rax, %rbp movq %r14, (%rax) movsd %xmm5, 8(%rax) movsd %xmm6, 16(%rax) movl $Main_mainzuzdszdwfold_closure, %ebx jmp *-8(%r13) .L9: movq $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax) movsd %xmm5, -16(%rax) movq $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax) leaq 25(%rdx), %rbx movsd %xmm6, 32(%rdx) leaq 9(%rdx), %r14 jmp *(%rbp) ------------------------------------------------------------------------ Second, 2.34s $ ghc-core sum-complex.hs -O2 -fvia-C -optc-O3 $ time ./sum-complex 1.0e9 :+ 1.0e9 ./sum-complex 2.33s user 0.01s system 99% cpu 2.347 total Main_mainzuzdszdwfold_info: leaq 32(%r12), %rax cmpq 144(%r13), %rax movq %r12, %rdx movq %rax, %r12 ja .L4 cmpq $100000000, %r14 je .L9 .L5: movsd .LC0(%rip), %xmm0 leaq 1(%r14), %r14 movq %rdx, %r12 addsd %xmm0, %xmm5 addsd %xmm0, %xmm6 jmp Main_mainzuzdszdwfold_info .L4: leaq -24(%rbp), %rax movq $32, 184(%r13) movl $Main_mainzuzdszdwfold_closure, %ebx movsd %xmm5, 8(%rax) movq %rax, %rbp movq %r14, (%rax) movsd %xmm6, 16(%rax) jmp *-8(%r13) .L9: movq $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax) movsd %xmm5, -16(%rax) movq $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax) leaq 25(%rdx), %rbx movsd %xmm6, 32(%rdx) leaq 9(%rdx), %r14 jmp *(%rbp) ------------------------------------------------------------------------ Native codegen, 3.57s ghc 6.12 -fasm -O2 $ time ./sum-complex 1.0e9 :+ 1.0e9 ./sum-complex 3.57s user 0.01s system 99% cpu 3.574 total Main_mainzuzdszdwfold_info: .Lc1i7: addq $32,%r12 cmpq 144(%r13),%r12 ja .Lc1ia movq %r14,%rax cmpq $100000000,%rax jne .Lc1id movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12) movsd %xmm5,-16(%r12) movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12) movsd %xmm6,(%r12) leaq -7(%r12),%rbx leaq -23(%r12),%r14 jmp *(%rbp) .Lc1ia: movq $32,184(%r13) movl $Main_mainzuzdszdwfold_closure,%ebx addq $-24,%rbp movq %r14,(%rbp) movsd %xmm5,8(%rbp) movsd %xmm6,16(%rbp) jmp *-8(%r13) .Lc1id: movsd %xmm6,%xmm0 addsd .Ln1if(%rip),%xmm0 movsd %xmm5,%xmm7 addsd .Ln1ig(%rip),%xmm7 leaq 1(%rax),%r14 movsd %xmm7,%xmm5 movsd %xmm0,%xmm6 addq $-32,%r12 jmp Main_mainzuzdszdwfold_info