Removing/deprecating -fvia-c

Hi all, We are planning to remove the -fvia-c way of compiling code (unregisterised compilers will continue to compile via C only, but registerised compilers will only use the native code generator). We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in 6.16. Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first? Thanks Ian

igloo:
Hi all,
We are planning to remove the -fvia-c way of compiling code (unregisterised compilers will continue to compile via C only, but registerised compilers will only use the native code generator). We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in 6.16.
Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first?
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops? As recently as last year -fvia-C -optc-O3 was still useful for some microbenchmarks -- what's changed in that time, or is expected to change? -- Don

I thought GHC's own codegen didn't do any instruction reordering for the
pipeline. I guess that ends up not being much of an issue in practice?
On Sun, Feb 14, 2010 at 12:58 PM, Don Stewart
igloo:
Hi all,
We are planning to remove the -fvia-c way of compiling code (unregisterised compilers will continue to compile via C only, but registerised compilers will only use the native code generator). We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in 6.16.
Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first?
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
As recently as last year -fvia-C -optc-O3 was still useful for some microbenchmarks -- what's changed in that time, or is expected to change?
-- Don _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

On 14/02/2010 17:58, Don Stewart wrote:
igloo:
Hi all,
We are planning to remove the -fvia-c way of compiling code (unregisterised compilers will continue to compile via C only, but registerised compilers will only use the native code generator). We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in 6.16.
Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first?
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
As recently as last year -fvia-C -optc-O3 was still useful for some microbenchmarks -- what's changed in that time, or is expected to change?
If you have benchmarks that show a significant difference, I'd be interested to see them. What I've done for 6.14.1 is to add the -msse2 flag to the x86 backend, so where previously we had to use -fvia-C -fexcess-precision -optc-O3 etc. to get reasonable floating point performance, now we can use -msse2 with the native code gen and get about the same results. In the future we have a couple of ways that things could get better: 1. The new back-end, which eventually will incorporate more optimisations at the C-- level, and potentially could produce good loop code. It will also free up some registers. 2. Compiling via LLVM. Dropping the C backend will give us more flexibility with calling conventions, letting us use more of the x86 registers for passing arguments. We can only make this change by removing -fvia-C, though. There's low hanging fruit here particularly for the x86 backend, as soon as we drop -fvia-C. There are other reasons to want to get rid of -fvia-C: - it doubles the testing surface - it's associated with a bucketload of grotesque Perl 4 code and gcc-specific hacks in the RTS headers. Cheers, Simon

marlowsd:
Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first?
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
As recently as last year -fvia-C -optc-O3 was still useful for some microbenchmarks -- what's changed in that time, or is expected to change?
If you have benchmarks that show a significant difference, I'd be interested to see them.
I've attached an example where there's a 40% variation (and it's a floating point benchmark). Roman would be seeing similar examples in the vector code. I'm all in favor of dropping the C backend, but I'm also wary that we don't have benchmarks to know what difference it is making. Here's a simple program testing a tight, floating point loop: import Data.Array.Vector import Data.Complex main = print . sumU $ replicateU (1000000000 :: Int) (1 :+ 1 ::Complex Double) Compiled with ghc 6.12, uvector-0.1.1.0 on a 64 bit linux box. The -fvia-C -optc-O3 is about 40% faster than -fasm. How does it fair with the new sse patches? I've attached the assembly below for each case.. -- Don ------------------------------------------------------------------------ Fastest. 2.17s. About 40% faster than -fasm $ time ./sum-complex 1.0e9 :+ 1.0e9 ./sum-complex 2.16s user 0.00s system 99% cpu 2.175 total Main_mainzuzdszdwfold_info: leaq 32(%r12), %rax movq %r12, %rdx cmpq 144(%r13), %rax movq %rax, %r12 ja .L4 cmpq $1000000000, %r14 je .L9 .L5: movsd .LC0(%rip), %xmm0 leaq 1(%r14), %r14 addsd %xmm0, %xmm5 addsd %xmm0, %xmm6 movq %rdx, %r12 jmp Main_mainzuzdszdwfold_info .L4: leaq -24(%rbp), %rax movq $32, 184(%r13) movq %rax, %rbp movq %r14, (%rax) movsd %xmm5, 8(%rax) movsd %xmm6, 16(%rax) movl $Main_mainzuzdszdwfold_closure, %ebx jmp *-8(%r13) .L9: movq $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax) movsd %xmm5, -16(%rax) movq $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax) leaq 25(%rdx), %rbx movsd %xmm6, 32(%rdx) leaq 9(%rdx), %r14 jmp *(%rbp) ------------------------------------------------------------------------ Second, 2.34s $ ghc-core sum-complex.hs -O2 -fvia-C -optc-O3 $ time ./sum-complex 1.0e9 :+ 1.0e9 ./sum-complex 2.33s user 0.01s system 99% cpu 2.347 total Main_mainzuzdszdwfold_info: leaq 32(%r12), %rax cmpq 144(%r13), %rax movq %r12, %rdx movq %rax, %r12 ja .L4 cmpq $100000000, %r14 je .L9 .L5: movsd .LC0(%rip), %xmm0 leaq 1(%r14), %r14 movq %rdx, %r12 addsd %xmm0, %xmm5 addsd %xmm0, %xmm6 jmp Main_mainzuzdszdwfold_info .L4: leaq -24(%rbp), %rax movq $32, 184(%r13) movl $Main_mainzuzdszdwfold_closure, %ebx movsd %xmm5, 8(%rax) movq %rax, %rbp movq %r14, (%rax) movsd %xmm6, 16(%rax) jmp *-8(%r13) .L9: movq $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax) movsd %xmm5, -16(%rax) movq $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax) leaq 25(%rdx), %rbx movsd %xmm6, 32(%rdx) leaq 9(%rdx), %r14 jmp *(%rbp) ------------------------------------------------------------------------ Native codegen, 3.57s ghc 6.12 -fasm -O2 $ time ./sum-complex 1.0e9 :+ 1.0e9 ./sum-complex 3.57s user 0.01s system 99% cpu 3.574 total Main_mainzuzdszdwfold_info: .Lc1i7: addq $32,%r12 cmpq 144(%r13),%r12 ja .Lc1ia movq %r14,%rax cmpq $100000000,%rax jne .Lc1id movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12) movsd %xmm5,-16(%r12) movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12) movsd %xmm6,(%r12) leaq -7(%r12),%rbx leaq -23(%r12),%r14 jmp *(%rbp) .Lc1ia: movq $32,184(%r13) movl $Main_mainzuzdszdwfold_closure,%ebx addq $-24,%rbp movq %r14,(%rbp) movsd %xmm5,8(%rbp) movsd %xmm6,16(%rbp) jmp *-8(%r13) .Lc1id: movsd %xmm6,%xmm0 addsd .Ln1if(%rip),%xmm0 movsd %xmm5,%xmm7 addsd .Ln1ig(%rip),%xmm7 leaq 1(%rax),%r14 movsd %xmm7,%xmm5 movsd %xmm0,%xmm6 addq $-32,%r12 jmp Main_mainzuzdszdwfold_info

dons:
marlowsd:
Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first?
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
As recently as last year -fvia-C -optc-O3 was still useful for some microbenchmarks -- what's changed in that time, or is expected to change?
If you have benchmarks that show a significant difference, I'd be interested to see them.
I've attached an example where there's a 40% variation (and it's a floating point benchmark). Roman would be seeing similar examples in the vector code.
Here's an example that doesn't use floating point: import Data.Array.Vector import Data.Bits main = print . sumU $ zipWith3U (\x y z -> x * y * z) (enumFromToU 1 (100000000 :: Int)) (enumFromToU 2 (100000001 :: Int)) (enumFromToU 7 (100000008 :: Int)) In core: main_$s$wfold :: Int# -> Int# -> Int# -> Int# -> Int# main_$s$wfold = \ (sc_s1l1 :: Int#) (sc1_s1l2 :: Int#) (sc2_s1l3 :: Int#) (sc3_s1l4 :: Int#) -> case ># sc2_s1l3 100000000 of _ { False -> case ># sc1_s1l2 100000001 of _ { False -> case ># sc_s1l1 100000008 of _ { False -> main_$s$wfold (+# sc_s1l1 1) (+# sc1_s1l2 1) (+# sc2_s1l3 1) (+# sc3_s1l4 (*# (*# sc2_s1l3 sc1_s1l2) sc_s1l1)); True -> sc3_s1l4 }; True -> sc3_s1l4 }; True -> sc3_s1l4 } Rather nice! -fvia-C -optc-O3 Main_mainzuzdszdwfold_info: cmpq $100000000, %rdi jg .L6 cmpq $100000001, %rsi jg .L6 cmpq $100000008, %r14 jle .L10 .L6: movq %r8, %rbx movq (%rbp), %rax jmp *%rax .L10: movq %rsi, %r10 leaq 1(%rsi), %rsi imulq %rdi, %r10 leaq 1(%rdi), %rdi imulq %r14, %r10 leaq 1(%r14), %r14 leaq (%r10,%r8), %r8 jmp Main_mainzuzdszdwfold_info Which looks ok. $ time ./zipwith3 3541230156834269568 ./zipwith3 0.33s user 0.00s system 99% cpu 0.337 total And -fasm we get very different code, and a bit of a slowdown: Main_mainzuzdszdwfold_info: .Lc1mo: cmpq $100000000,%rdi jg .Lc1mq cmpq $100000001,%rsi jg .Lc1ms cmpq $100000008,%r14 jg .Lc1mv movq %rsi,%rax imulq %r14,%rax movq %rdi,%rcx imulq %rax,%rcx movq %r8,%rax addq %rcx,%rax leaq 1(%rdi),%rcx leaq 1(%rsi),%rdx incq %r14 movq %rdx,%rsi movq %rcx,%rdi movq %rax,%r8 jmp Main_mainzuzdszdwfold_info .Lc1mq: movq %r8,%rbx jmp *(%rbp) .Lc1ms: movq %r8,%rbx jmp *(%rbp) .Lc1mv: movq %r8,%rbx jmp *(%rbp) Slower: $ time ./zipwith3 3541230156834269568 ./zipwith3 0.38s user 0.00s system 98% cpu 0.384 total Now maybe we need to wait on the new backend optimizations to get there? -- Don

On 15/02/2010 18:29, Don Stewart wrote:
marlowsd:
Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first?
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
As recently as last year -fvia-C -optc-O3 was still useful for some microbenchmarks -- what's changed in that time, or is expected to change?
If you have benchmarks that show a significant difference, I'd be interested to see them.
I've attached an example where there's a 40% variation (and it's a floating point benchmark). Roman would be seeing similar examples in the vector code.
I'm all in favor of dropping the C backend, but I'm also wary that we don't have benchmarks to know what difference it is making.
Here's a simple program testing a tight, floating point loop:
import Data.Array.Vector import Data.Complex
main = print . sumU $ replicateU (1000000000 :: Int) (1 :+ 1 ::Complex Double)
Compiled with ghc 6.12, uvector-0.1.1.0 on a 64 bit linux box.
The -fvia-C -optc-O3 is about 40% faster than -fasm. How does it fair with the new sse patches?
I've attached the assembly below for each case..
-- Don
------------------------------------------------------------------------
Fastest. 2.17s. About 40% faster than -fasm
$ time ./sum-complex 1.0e9 :+ 1.0e9 ./sum-complex 2.16s user 0.00s system 99% cpu 2.175 total
Main_mainzuzdszdwfold_info: leaq 32(%r12), %rax movq %r12, %rdx cmpq 144(%r13), %rax movq %rax, %r12 ja .L4 cmpq $1000000000, %r14 je .L9 .L5: movsd .LC0(%rip), %xmm0 leaq 1(%r14), %r14 addsd %xmm0, %xmm5 addsd %xmm0, %xmm6 movq %rdx, %r12 jmp Main_mainzuzdszdwfold_info
.L4: leaq -24(%rbp), %rax movq $32, 184(%r13) movq %rax, %rbp movq %r14, (%rax) movsd %xmm5, 8(%rax) movsd %xmm6, 16(%rax) movl $Main_mainzuzdszdwfold_closure, %ebx jmp *-8(%r13) .L9: movq $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax) movsd %xmm5, -16(%rax) movq $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax) leaq 25(%rdx), %rbx movsd %xmm6, 32(%rdx) leaq 9(%rdx), %r14 jmp *(%rbp)
------------------------------------------------------------------------
Second, 2.34s
$ ghc-core sum-complex.hs -O2 -fvia-C -optc-O3 $ time ./sum-complex 1.0e9 :+ 1.0e9 ./sum-complex 2.33s user 0.01s system 99% cpu 2.347 total
Main_mainzuzdszdwfold_info: leaq 32(%r12), %rax cmpq 144(%r13), %rax movq %r12, %rdx movq %rax, %r12 ja .L4 cmpq $100000000, %r14 je .L9 .L5: movsd .LC0(%rip), %xmm0 leaq 1(%r14), %r14 movq %rdx, %r12 addsd %xmm0, %xmm5 addsd %xmm0, %xmm6 jmp Main_mainzuzdszdwfold_info
.L4: leaq -24(%rbp), %rax movq $32, 184(%r13) movl $Main_mainzuzdszdwfold_closure, %ebx movsd %xmm5, 8(%rax) movq %rax, %rbp movq %r14, (%rax) movsd %xmm6, 16(%rax) jmp *-8(%r13)
.L9: movq $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax) movsd %xmm5, -16(%rax) movq $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax) leaq 25(%rdx), %rbx movsd %xmm6, 32(%rdx) leaq 9(%rdx), %r14 jmp *(%rbp)
------------------------------------------------------------------------
Native codegen, 3.57s
ghc 6.12 -fasm -O2 $ time ./sum-complex 1.0e9 :+ 1.0e9 ./sum-complex 3.57s user 0.01s system 99% cpu 3.574 total
Main_mainzuzdszdwfold_info: .Lc1i7: addq $32,%r12 cmpq 144(%r13),%r12 ja .Lc1ia movq %r14,%rax cmpq $100000000,%rax jne .Lc1id movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12) movsd %xmm5,-16(%r12) movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12) movsd %xmm6,(%r12) leaq -7(%r12),%rbx leaq -23(%r12),%r14 jmp *(%rbp) .Lc1ia: movq $32,184(%r13) movl $Main_mainzuzdszdwfold_closure,%ebx addq $-24,%rbp movq %r14,(%rbp) movsd %xmm5,8(%rbp) movsd %xmm6,16(%rbp) jmp *-8(%r13) .Lc1id: movsd %xmm6,%xmm0 addsd .Ln1if(%rip),%xmm0 movsd %xmm5,%xmm7 addsd .Ln1ig(%rip),%xmm7 leaq 1(%rax),%r14 movsd %xmm7,%xmm5 movsd %xmm0,%xmm6 addq $-32,%r12 jmp Main_mainzuzdszdwfold_info
I manged to improve this: Main_mainzuzdszdwfold_info: .Lc1lP: addq $32,%r12 cmpq 144(%r13),%r12 ja .Lc1lS movq %r14,%rax cmpq $1000000000,%rax jne .Lc1lV movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12) movsd %xmm6,-16(%r12) movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12) movsd %xmm5,(%r12) leaq -7(%r12),%rbx leaq -23(%r12),%r14 jmp *(%rbp) .Lc1lS: movq $32,184(%r13) movl $Main_mainzuzdszdwfold_closure,%ebx addq $-24,%rbp movsd %xmm5,(%rbp) movsd %xmm6,8(%rbp) movq %r14,16(%rbp) jmp *-8(%r13) .Lc1lV: addsd .Ln1m2(%rip),%xmm5 addsd .Ln1m3(%rip),%xmm6 leaq 1(%rax),%r14 addq $-32,%r12 jmp Main_mainzuzdszdwfold_info from 9 instructions in the last block down to 5 (one instruction fewer than gcc). I haven't commoned up the two constant 1's though, that'd mean doing some CSE. On my machine with GHC HEAD and gcc 4.3.0, the gcc version runs in 2.0s, with the NCG at 2.3s. I put the difference down to a bit of instruction scheduling done by gcc, and that extra constant load. But let's face it, all of this code is crappy. It should be a tiny little loop rather than a tail-call with argument passing, and that's what we'll get with the new backend (eventually). LLVM probably won't turn it into a loop on its own, that needs to be done before the code gets passed to LLVM. Have you looked at this example on x86? It's *far* worse and runs about 5 times slower. Cheers, Simon

marlowsd:
I manged to improve this:
Main_mainzuzdszdwfold_info: .Lc1lP: addq $32,%r12 cmpq 144(%r13),%r12 ja .Lc1lS movq %r14,%rax cmpq $1000000000,%rax jne .Lc1lV movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12) movsd %xmm6,-16(%r12) movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12) movsd %xmm5,(%r12) leaq -7(%r12),%rbx leaq -23(%r12),%r14 jmp *(%rbp) .Lc1lS: movq $32,184(%r13) movl $Main_mainzuzdszdwfold_closure,%ebx addq $-24,%rbp movsd %xmm5,(%rbp) movsd %xmm6,8(%rbp) movq %r14,16(%rbp) jmp *-8(%r13) .Lc1lV: addsd .Ln1m2(%rip),%xmm5 addsd .Ln1m3(%rip),%xmm6 leaq 1(%rax),%r14 addq $-32,%r12 jmp Main_mainzuzdszdwfold_info
from 9 instructions in the last block down to 5 (one instruction fewer than gcc). I haven't commoned up the two constant 1's though, that'd mean doing some CSE.
On my machine with GHC HEAD and gcc 4.3.0, the gcc version runs in 2.0s, with the NCG at 2.3s. I put the difference down to a bit of instruction scheduling done by gcc, and that extra constant load.
But let's face it, all of this code is crappy. It should be a tiny little loop rather than a tail-call with argument passing, and that's what we'll get with the new backend (eventually). LLVM probably won't turn it into a loop on its own, that needs to be done before the code gets passed to LLVM.
Agreed. Ideally the new backend would be (starting to be?) usable about the time -fvia-C dies? Otherwise there's always going to be something that gcc spots that the current codegen won't. Then again, killing perl from the ghc toolchain, and having a funeral/dancing on its grave, would be satisfying in itself :-)
Have you looked at this example on x86? It's *far* worse and runs about 5 times slower.
x86 scares me.. :)

On Feb 16, 2010, at 12:51 PM, Don Stewart wrote:
about the time -fvia-C dies?
I wrote more lines of raw C code in my youth than I'd care to remember, before coming to the realization that the most important benchmark is "brain time" not "cpu time". With this revelation, there is no language like Haskell. Given the choice, I vote for dedicating our precious development resources to the future of Haskell, not the past of Perl and gcc. Worrying about what we lose is important, but I'd propose that we continue to benchmark the CURRENT version of GHC using -fvia-c against FUTURE versions of GHC, however it evolves. This doesn't sound fair, but that's my entire point. GHC will get better faster, the sooner that the development team can forget about supporting -fvia-c.

Simon Marlow:
[..] But let's face it, all of this code is crappy. It should be a tiny little loop rather than a tail-call with argument passing, and that's what we'll get with the new backend (eventually). LLVM probably won't turn it into a loop on its own, that needs to be done before the code gets passed to LLVM.
I fully agree with Simon. There is no point in doctoring around with an inherently broken approach, and to waste developer cycles tracking changes to gcc and so forth. Gladly, we have two technologies ready (the new backend and LLVM) that have the potential to significantly improve the current situation. Let's spend developer cycles on these instead. Manuel

Am Montag 15 Februar 2010 17:37:55 schrieb Simon Marlow:
On 14/02/2010 17:58, Don Stewart wrote:
igloo:
Hi all,
We are planning to remove the -fvia-c way of compiling code (unregisterised compilers will continue to compile via C only, but registerised compilers will only use the native code generator). We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in 6.16.
Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first?
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
As recently as last year -fvia-C -optc-O3 was still useful for some microbenchmarks -- what's changed in that time, or is expected to change?
If you have benchmarks that show a significant difference, I'd be interested to see them.
I have a benchmark (or a couple) from the Beginners mailing list two weeks ago (thread starting in January at http://www.haskell.org/pipermail/beginners/2010-January/003356.html and continued in February at http://www.haskell.org/pipermail/beginners/2010-February/003373.html ff) which show a significant difference. Loop.hs: ======================================== {-# LANGUAGE BangPatterns #-} module Main (main) where main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let !mx = (4/eps) !pi14 = pisum mx putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14) pisum :: Double -> Double pisum cut = go True 1 0 where go b n s | cut < n = if b then s+1/(2*n) else s-1/(2*n) go True n !s = go False (n+2) (s+recip n) go False n !s = go True (n+2) (s-recip n) ======================================== $ echo '1e-8' | time ./Loop ghc -O2 --make: 4.53s ghc -O2 -fexcess-precision --make: 4.54s ghc -O2 -fvia-C -optc-O3 --make: 7.52s ghc -O2 -fvia-C -optc-O3 -optc-ffast-math --make: 7.53s ghc -O2 -fvia-C -optc-O3 -optc-ffast-math -optc-fno-float-store --make: 3.02s ghc -O2 -fvia-C -optc-O3 -optc-fno-float-store --make: 3.02s ghc -O2 -fexcess-precision -fvia-C -optc-O3 --make: 3.02s The loop coded in C and compiled with gcc -O3 [-ffast-math, -fno-float- store, -msse2 make no difference there] also takes 3.02s (gcc-4.3.2), 2.70s with icc -O3 (icc 11.0). It is probably worth pointing out, however, that on Markus Böhm's box running Windows XP, the native code generator produced better code than the via-C route (NCG code was faster there than on my box [openSUSE 11.1], while -O2 -fexcess-precision -fvia-C -optc-O3 on his box was slower than NCG on mine). Similar results for Fusion.hs (uses stream-fusion package) ======================================== module Main (main) where import qualified Data.List.Stream as S main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let !mx = floor (4/eps) !k = (mx+1) `quot` 2 putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz n = (4 *) $ S.sum $ S.take n step step :: [Double] step = S.unfoldr phi (True,1) where phi (sig,d) | sig = Just (1/d, (False,d+2)) | otherwise = Just (negate (1/d), (True,d+2)) ======================================== ghc -O2 [-fexcess-precision] --make: 4.22s ghc -O2 -fexcess-precision -fvia-C -optc-O3 --make: 3.02s Using lists instead of loops, List.hs ======================================== module Main (main) where import Data.List (unfoldr) main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let mx = floor (4/eps) !k = (mx+1) `quot` 2 putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz n = (4 *) $ sum $ take n step step :: [Double] step = unfoldr phi (True,1) where phi (sig,d) | sig = Just (1/d, (False,d+2)) | otherwise = Just (negate (1/d), (True,d+2)) ======================================== things are much slower, 23.60s vs. 18.15s, but the via-C route is again significantly faster.
What I've done for 6.14.1 is to add the -msse2 flag to the x86 backend, so where previously we had to use -fvia-C -fexcess-precision -optc-O3 etc. to get reasonable floating point performance, now we can use -msse2 with the native code gen and get about the same results.
Can I test whether I get about the same results as with -fvia-C ... for the above? I.e., is it in the HEAD, and would I have to pass -msse2 on the command line or is it implied by -O2 already?
In the future we have a couple of ways that things could get better:
1. The new back-end, which eventually will incorporate more optimisations at the C-- level, and potentially could produce good loop code. It will also free up some registers.
2. Compiling via LLVM.
Dropping the C backend will give us more flexibility with calling conventions, letting us use more of the x86 registers for passing arguments. We can only make this change by removing -fvia-C, though. There's low hanging fruit here particularly for the x86 backend, as soon as we drop -fvia-C.
There are other reasons to want to get rid of -fvia-C:
- it doubles the testing surface
- it's associated with a bucketload of grotesque Perl 4 code and gcc-specific hacks in the RTS headers.
Cheers, Simon

On 15/02/2010, at 04:58, Don Stewart wrote:
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
FWIW, I don't think we even use -fvia-C when benchmarking. In general, -fvia-C is a dead end wrt numeric performance because gcc just doesn't optimise well enough. So even if we generated code that gcc could optimise properly (which we don't atm), we still would be way behind highly optimising compilers like Intel's or Sun's. IMO, the LLVM backend is the way to go here. Roman

On 02/16/10 20:13, Roman Leshchinskiy wrote:
On 15/02/2010, at 04:58, Don Stewart wrote:
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
FWIW, I don't think we even use -fvia-C when benchmarking. In general, -fvia-C is a dead end wrt numeric performance because gcc just doesn't optimise well enough. So even if we generated code that gcc could optimise properly (which we don't atm), we still would be way behind highly optimising compilers like Intel's or Sun's. IMO, the LLVM backend is the way to go here.
LLVM and GCC are open-source projects that are improving over time... is there any particular reason we expect GCC to have poor numeric performance forever? [now I rehash why to remove -fvia-C anyway. Feel free to ignore me.] ...However, we think the native-code backends (and perhaps LLVM) will be good enough within the next few years to catch up with registerized via-C; and as soon as they do, it'll be an advantage to remove -fvia-C so that we're not restricted by C's calling conventions and other such restrictions. And it's much easier to predict GCC's path for just the next two years (for example). Actually, even if GCC somehow becomes the premier optimizer that we'll never be able to replicate with our own feeble efforts, it still may not be worth keeping/using/maintaining the current perl hacks for that purpose (combining the effort needed to maintain them over the time until then, and the effort needed to adapt these hacks to C with loops that let GCC optimize better... er, it probably won't be *that much* better except in things simple enough that it's easier to add optimizations inside GHC than to maintain this backend anyway). scooter.phd wrote:
Before the "Pile on Scooter" fest starts, bear in mind that LLVM effectively restricts you to its current backends. As the guy who started CellSPU in LLVM and who needs a good couple of research months off to finish it, think this through. Carefully.
Currently we have - "unregisterized", using GCC. This is the most portable, fairly reliable, and will stay. - "registerized" (fvia-C), using GCC. This tends to break (e.g. registerized PowerPC is broken) - -fasm (native code generator). This can break from refactoring in GHC but not from new versions of GCC, so it is less annoying than the -fvia-C path. And we'll hopefully have - LLVM backend, with (currently) fewer supported architectures than GCC but more than GHC's native-codegens. If you finish your CellSPU work, great! GHC 6.16 or so might be able to perform well on CellSPU! If not, then the present situation of using the "unregisterized" (slow) C backend will still be available; we don't lose much by removing the few current "registerized" backends. (Actually it's likely to require some build-system fixes in porting to any new platform, even with the unregisterized backend.) (Also, if it turns out to be easier to make a GHC native-code generator backend than an LLVM backend, then maybe that will be yet another theoretical possibility!) -Isaac

On 17/02/2010, at 18:37, Isaac Dupree wrote:
LLVM and GCC are open-source projects that are improving over time... is there any particular reason we expect GCC to have poor numeric performance forever?
Past experience :-) GCC has been around for a while and if it doesn't optimise numeric code well by now, there is no reason to believe that it ever will. Roman

On 17/02/10 07:37, Isaac Dupree wrote:
On 02/16/10 20:13, Roman Leshchinskiy wrote:
On 15/02/2010, at 04:58, Don Stewart wrote:
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
FWIW, I don't think we even use -fvia-C when benchmarking. In general, -fvia-C is a dead end wrt numeric performance because gcc just doesn't optimise well enough. So even if we generated code that gcc could optimise properly (which we don't atm), we still would be way behind highly optimising compilers like Intel's or Sun's. IMO, the LLVM backend is the way to go here.
LLVM and GCC are open-source projects that are improving over time... is there any particular reason we expect GCC to have poor numeric performance forever?
The problem is not the quality of the code generator in gcc vs. LLVM; indeed gcc is generally regarded as generating better code than LLVM right now, although LLVM is improving. The reason that using gcc is worse than LLVM for us in that when GHC uses gcc as a backend it generates C, whereas the LLVM backend generates code directly from GHC's internal C-- representation. Compiling via C is a tricky business that ultimately leads to not being able to generate as good code as you want(*). It would be entirely possible to hook into gcc's backend directly from GHC as an alternative to the LLVM backend, though LLVM is really intended to be used this way and has a more polished API. Even so, LLVM doesn't let us generate exactly the code we'd like: we can't use GHC's tables-next-to-code optimisation. Measurements done by David Terei who built the LLVM backend apparently show that this doesn't matter much (~3% slower IIRC), though I'm still surprised that all those extra indirections don't have more of an effect, I think we need to investigate this more closely. It's important because if the LLVM backend is to be a compile-time option, we have to either drop tables-next-to-code, or wait until LLVM supports generating code in that style. (*) Though the main reason for this is the need to keep accurate GC information; if you're prepared to forego that (as in JHC) then you can generate much more optimisable C code.
[now I rehash why to remove -fvia-C anyway. Feel free to ignore me.]
...However, we think the native-code backends (and perhaps LLVM) will be good enough within the next few years to catch up with registerized via-C;
I should point out that for most Haskell programs, the NCG is already as fast (in some cases faster) than via C. The benchmarks showing a difference are all of the small tight loop kind - which are important to some people, I don't dispute that, but I expect that most people wouldn't notice the difference. Cheers, Simon

Am Mittwoch 17 Februar 2010 15:19:33 schrieb Simon Marlow:
I should point out that for most Haskell programs, the NCG is already as fast (in some cases faster) than via C. The benchmarks showing a difference are all of the small tight loop kind - which are important to some people, I don't dispute that, but I expect that most people wouldn't notice the difference.
Probably. And where the tight loop takes a significant amount of the running time, one can usually write that in C and use the FFI if the NCG doesn't produce comparable code. Granted, it's not as nice, but removing the via-C route won't be a show-stopper for those loops either, I think.
Cheers, Simon

On Wed, Feb 17, 2010 at 6:19 AM, Simon Marlow
On 17/02/10 07:37, Isaac Dupree wrote:
On 02/16/10 20:13, Roman Leshchinskiy wrote:
On 15/02/2010, at 04:58, Don Stewart wrote:
Do we have the blessing of the DPH team, wrt. tight, numeric inner
loops?
FWIW, I don't think we even use -fvia-C when benchmarking. In general, -fvia-C is a dead end wrt numeric performance because gcc just doesn't optimise well enough. So even if we generated code that gcc could optimise properly (which we don't atm), we still would be way behind highly optimising compilers like Intel's or Sun's. IMO, the LLVM backend is the way to go here.
LLVM and GCC are open-source projects that are improving over time... is there any particular reason we expect GCC to have poor numeric performance forever?
The problem is not the quality of the code generator in gcc vs. LLVM; indeed gcc is generally regarded as generating better code than LLVM right now, although LLVM is improving.
Depends a lot on the benchmark. The FreeBSD kernel dev crowd (one of whom works for me) have seen performance improvements between 10-20% using LLVM and clang over gcc. It also depends heavily on which optimization passes you have LLVM invoke -- bear in mind that LLVM is a compiler optimization infrastructure first and foremost. The reason that using gcc is worse than LLVM for us in that when GHC uses
gcc as a backend it generates C, whereas the LLVM backend generates code directly from GHC's internal C-- representation. Compiling via C is a tricky business that ultimately leads to not being able to generate as good code as you want(*). It would be entirely possible to hook into gcc's backend directly from GHC as an alternative to the LLVM backend, though LLVM is really intended to be used this way and has a more polished API.
Let's be a bit more specific: you can directly generate the LLVM intermediate representation (IR) and pass that off to the LLVM optimization passes. With gcc, you generate C code that then is processed by gcc command line interface. I wouldn't suggest hooking into gcc's anything. LLVM is much cleaner.
Even so, LLVM doesn't let us generate exactly the code we'd like: we can't use GHC's tables-next-to-code optimisation. Measurements done by David Terei who built the LLVM backend apparently show that this doesn't matter much (~3% slower IIRC), though I'm still surprised that all those extra indirections don't have more of an effect, I think we need to investigate this more closely. It's important because if the LLVM backend is to be a compile-time option, we have to either drop tables-next-to-code, or wait until LLVM supports generating code in that style.
This sounds like an impedance mismatch between GHC's concept of IR and LLVM's. There's also the danger of trying to prematurely optimize LLVM as if it were a native backend rather than separating GHC and Haskell language-specific optimizations from LLVM's optimizations. Like thinking you can do your own register allocation better than LLVM, for example, is a common one that I've seen. Just don't -- value tracing through SSA is what LLVM is spectacularly good at. Use as many temporary variables as you like; LLVM will eventually eliminate them. All that said, if LLVM were to replace GHC's native backends, then one could shift focus to engineering the IR impedance mismatch. [disclaimer: grain of salt speculation, haven't read the code] Tables-next-to-code has an obvious cache-friendliness property, BTW. Generally, there's going to be some instruction prefetch into the cache. This is likely why it's faster. Otherwise, you have to warm up the data cache, since LLVM spills the tables into the target's constant pool. (*) Though the main reason for this is the need to keep accurate GC
information; if you're prepared to forego that (as in JHC) then you can generate much more optimisable C code.
[now I rehash why to remove -fvia-C anyway. Feel free to ignore me.]
...However, we think the native-code backends (and perhaps LLVM) will be good enough within the next few years to catch up with registerized via-C;
I should point out that for most Haskell programs, the NCG is already as fast (in some cases faster) than via C. The benchmarks showing a difference are all of the small tight loop kind - which are important to some people, I don't dispute that, but I expect that most people wouldn't notice the difference.
NCGs should be faster than plain old C. Trying to produce optimized C is the fool's errand, and I'm starting to agree with dropping that. My worry was that the C backend would be dropped in its entirety, also a fool's errand. -scooter

On 17/02/2010 21:15, Scott Michel wrote:
Depends a lot on the benchmark. The FreeBSD kernel dev crowd (one of whom works for me) have seen performance improvements between 10-20% using LLVM and clang over gcc. It also depends heavily on which optimization passes you have LLVM invoke -- bear in mind that LLVM is a compiler optimization infrastructure first and foremost.
Right, such benchmarks tend to be quickly out of date especially when both projects are being actively developed. I have no vested interest in either - we'll use whatever suits us better, and that seems to be LLVM.
Even so, LLVM doesn't let us generate exactly the code we'd like: we can't use GHC's tables-next-to-code optimisation. Measurements done by David Terei who built the LLVM backend apparently show that this doesn't matter much (~3% slower IIRC), though I'm still surprised that all those extra indirections don't have more of an effect, I think we need to investigate this more closely. It's important because if the LLVM backend is to be a compile-time option, we have to either drop tables-next-to-code, or wait until LLVM supports generating code in that style.
This sounds like an impedance mismatch between GHC's concept of IR and LLVM's.
It certainly is an impedence mismatch - there's no good reason why LLVM couldn't generate the code we want, but its IR doesn't allow us to represent it. So there's every reason to believe that this could be fixed in LLVM without too much difficulty. We can work around the impedence mismatch the other way, by not using tables-next-to-code in GHC, but that costs us a bit in performance.
[disclaimer: grain of salt speculation, haven't read the code] Tables-next-to-code has an obvious cache-friendliness property, BTW.
Oh absolutely, that's why it's not a clear win and some people argue that the code cache pollution should outweigh the negative effects of those extra indirections. Having seen the effect of branch mispredictions though I'm inclined to believe that those indirections are more expensive, though. The cost is this: every return to a stack frame takes two indirections rather than one. Of course GHC's two representations are not the only two you could choose - people have been designing clever ways to map code addresses to data structures for a long time. If returning to a stack frame is the dominant operation then you would put the return address on the stack and use a hash table to map those to info tables. That trades off mutator time against GC time, and we don't know whether it would be a win, but we do know it would take a lot of effort to find out. The tables-next-to-code representation means that you don't have to fiddle around with hash tables, so it's simpler and probably faster.
Generally, there's going to be some instruction prefetch into the cache. This is likely why it's faster. Otherwise, you have to warm up the data cache, since LLVM spills the tables into the target's constant pool.
Not sure what "spills the tables" means, but maybe that's not important.
NCGs should be faster than plain old C. Trying to produce optimized C is the fool's errand, and I'm starting to agree with dropping that. My worry was that the C backend would be dropped in its entirety, also a fool's errand.
Yes, exactly. Cheers, Simon

On Tue, Feb 16, 2010 at 11:37 PM, Isaac Dupree < ml@isaac.cedarswampstudios.org> wrote:
If you finish your CellSPU work, great! GHC 6.16 or so might be able to perform well on CellSPU! If not, then the present situation of using the "unregisterized" (slow) C backend will still be available; we don't lose much by removing the few current "registerized" backends. (Actually it's likely to require some build-system fixes in porting to any new platform, even with the unregisterized backend.) (Also, if it turns out to be easier to make a GHC native-code generator backend than an LLVM backend, then maybe that will be yet another theoretical possibility!)
I might be able to get to it, but the odds are low for two reasons. First, IBM has effectively stopped all development on future Cell hardware (if lack of Cell demos at Supercomputing are a good indication, plus no new IBM Cell hardware for two years and Sony looking for a new processor for the PlayStation 4.) Second, functional programming and hybrid multicore is not well understood. Sure, progress is being made on functional programming on GPUs for specific kinds of problems -- but GPUs are fundamentally a data stream model of programming. Cell is a very different hybrid multicore model that isn't entirely data stream, but has some of its characteristics, with a lot of physical limitations. A better bet would be to target a tiled array processor, such as the Tilera Tile64 (aka MIT RAW). -scooter

On Sun, Feb 14, 2010 at 5:51 PM, Ian Lynagh
Hi all,
We are planning to remove the -fvia-c way of compiling code (unregisterised compilers will continue to compile via C only, but registerised compilers will only use the native code generator). We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in 6.16.
Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first?
What about loop unrolling? I remember seeing some discussion about that quite a while ago. Do we have any feature requests (i.e. performance bugs) related to that? Cheers, Johan
participants (11)
-
Daniel Fischer
-
Daniel Peebles
-
Dave Bayer
-
Don Stewart
-
Ian Lynagh
-
Isaac Dupree
-
Johan Tibell
-
Manuel M T Chakravarty
-
Roman Leshchinskiy
-
Scott Michel
-
Simon Marlow