
dons:
marlowsd:
Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first?
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
As recently as last year -fvia-C -optc-O3 was still useful for some microbenchmarks -- what's changed in that time, or is expected to change?
If you have benchmarks that show a significant difference, I'd be interested to see them.
I've attached an example where there's a 40% variation (and it's a floating point benchmark). Roman would be seeing similar examples in the vector code.
Here's an example that doesn't use floating point: import Data.Array.Vector import Data.Bits main = print . sumU $ zipWith3U (\x y z -> x * y * z) (enumFromToU 1 (100000000 :: Int)) (enumFromToU 2 (100000001 :: Int)) (enumFromToU 7 (100000008 :: Int)) In core: main_$s$wfold :: Int# -> Int# -> Int# -> Int# -> Int# main_$s$wfold = \ (sc_s1l1 :: Int#) (sc1_s1l2 :: Int#) (sc2_s1l3 :: Int#) (sc3_s1l4 :: Int#) -> case ># sc2_s1l3 100000000 of _ { False -> case ># sc1_s1l2 100000001 of _ { False -> case ># sc_s1l1 100000008 of _ { False -> main_$s$wfold (+# sc_s1l1 1) (+# sc1_s1l2 1) (+# sc2_s1l3 1) (+# sc3_s1l4 (*# (*# sc2_s1l3 sc1_s1l2) sc_s1l1)); True -> sc3_s1l4 }; True -> sc3_s1l4 }; True -> sc3_s1l4 } Rather nice! -fvia-C -optc-O3 Main_mainzuzdszdwfold_info: cmpq $100000000, %rdi jg .L6 cmpq $100000001, %rsi jg .L6 cmpq $100000008, %r14 jle .L10 .L6: movq %r8, %rbx movq (%rbp), %rax jmp *%rax .L10: movq %rsi, %r10 leaq 1(%rsi), %rsi imulq %rdi, %r10 leaq 1(%rdi), %rdi imulq %r14, %r10 leaq 1(%r14), %r14 leaq (%r10,%r8), %r8 jmp Main_mainzuzdszdwfold_info Which looks ok. $ time ./zipwith3 3541230156834269568 ./zipwith3 0.33s user 0.00s system 99% cpu 0.337 total And -fasm we get very different code, and a bit of a slowdown: Main_mainzuzdszdwfold_info: .Lc1mo: cmpq $100000000,%rdi jg .Lc1mq cmpq $100000001,%rsi jg .Lc1ms cmpq $100000008,%r14 jg .Lc1mv movq %rsi,%rax imulq %r14,%rax movq %rdi,%rcx imulq %rax,%rcx movq %r8,%rax addq %rcx,%rax leaq 1(%rdi),%rcx leaq 1(%rsi),%rdx incq %r14 movq %rdx,%rsi movq %rcx,%rdi movq %rax,%r8 jmp Main_mainzuzdszdwfold_info .Lc1mq: movq %r8,%rbx jmp *(%rbp) .Lc1ms: movq %r8,%rbx jmp *(%rbp) .Lc1mv: movq %r8,%rbx jmp *(%rbp) Slower: $ time ./zipwith3 3541230156834269568 ./zipwith3 0.38s user 0.00s system 98% cpu 0.384 total Now maybe we need to wait on the new backend optimizations to get there? -- Don