Re: Removing/deprecating -fvia-c

15 Feb 2010

      marlowsd:
...
...
...
Simon Marlow has recently fixed FP performance for modern x86 chips in
the native code generator in the HEAD. That was the last reason we know
of to prefer via-C to the native code generators. But before we start
the removal process, does anyone know of any other problems with the
native code generators that need to be fixed first?
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
As recently as last year -fvia-C -optc-O3 was still useful for some
microbenchmarks -- what's changed in that time, or is expected to change?
If you have benchmarks that show a significant difference, I'd be  
interested to see them.
I've attached an example where there's a 40% variation (and it's a
floating point benchmark). Roman would be seeing similar examples in the
vector code.

I'm all in favor of dropping the C backend, but I'm also wary that we
don't have benchmarks to know what difference it is making.

Here's a simple program testing a tight, floating point loop:

    import Data.Array.Vector
    import Data.Complex

    main = print . sumU $ replicateU (1000000000 :: Int) (1 :+ 1 ::Complex Double)

Compiled with ghc 6.12, uvector-0.1.1.0 on a 64 bit linux box.

The -fvia-C -optc-O3 is about 40% faster than -fasm.
How does it fair with the new sse patches?

I've attached the assembly below for each case..

-- Don

------------------------------------------------------------------------

Fastest. 2.17s. About 40% faster than -fasm

    $ time ./sum-complex                                             
    1.0e9 :+ 1.0e9
    ./sum-complex  2.16s user 0.00s system 99% cpu 2.175 total

Main_mainzuzdszdwfold_info:
        leaq    32(%r12), %rax
        movq    %r12, %rdx
        cmpq    144(%r13), %rax
        movq    %rax, %r12
        ja      .L4
        cmpq    $1000000000, %r14
        je      .L9
.L5:
        movsd   .LC0(%rip), %xmm0
        leaq    1(%r14), %r14
        addsd   %xmm0, %xmm5
        addsd   %xmm0, %xmm6
        movq    %rdx, %r12
        jmp     Main_mainzuzdszdwfold_info

.L4:
        leaq    -24(%rbp), %rax
        movq    $32, 184(%r13)
        movq    %rax, %rbp
        movq    %r14, (%rax)
        movsd   %xmm5, 8(%rax)
        movsd   %xmm6, 16(%rax)
        movl    $Main_mainzuzdszdwfold_closure, %ebx
        jmp     *-8(%r13)
.L9:
        movq    $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax)
        movsd   %xmm5, -16(%rax)
        movq    $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax)
        leaq    25(%rdx), %rbx
        movsd   %xmm6, 32(%rdx)
        leaq    9(%rdx), %r14
        jmp     *(%rbp)

------------------------------------------------------------------------

Second, 2.34s

    $ ghc-core sum-complex.hs -O2 -fvia-C -optc-O3
    $ time ./sum-complex
    1.0e9 :+ 1.0e9
    ./sum-complex  2.33s user 0.01s system 99% cpu 2.347 total

Main_mainzuzdszdwfold_info:
        leaq    32(%r12), %rax
        cmpq    144(%r13), %rax
        movq    %r12, %rdx
        movq    %rax, %r12
        ja      .L4
        cmpq    $100000000, %r14
        je      .L9
.L5:
        movsd   .LC0(%rip), %xmm0
        leaq    1(%r14), %r14
        movq    %rdx, %r12
        addsd   %xmm0, %xmm5
        addsd   %xmm0, %xmm6
        jmp     Main_mainzuzdszdwfold_info

.L4:
        leaq    -24(%rbp), %rax
        movq    $32, 184(%r13)
        movl    $Main_mainzuzdszdwfold_closure, %ebx
        movsd   %xmm5, 8(%rax)
        movq    %rax, %rbp
        movq    %r14, (%rax)
        movsd   %xmm6, 16(%rax)
        jmp     *-8(%r13)

.L9:
        movq    $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax)
        movsd   %xmm5, -16(%rax)
        movq    $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax)
        leaq    25(%rdx), %rbx
        movsd   %xmm6, 32(%rdx)
        leaq    9(%rdx), %r14
        jmp     *(%rbp)

------------------------------------------------------------------------

Native codegen, 3.57s

 ghc 6.12 -fasm -O2
 $ time ./sum-complex
 1.0e9 :+ 1.0e9
 ./sum-complex  3.57s user 0.01s system 99% cpu 3.574 total

Main_mainzuzdszdwfold_info:
.Lc1i7:
        addq $32,%r12
        cmpq 144(%r13),%r12
        ja .Lc1ia
        movq %r14,%rax
        cmpq $100000000,%rax
        jne .Lc1id
        movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12)
        movsd %xmm5,-16(%r12)
        movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12)
        movsd %xmm6,(%r12)
        leaq -7(%r12),%rbx
        leaq -23(%r12),%r14
        jmp *(%rbp)
.Lc1ia:
        movq $32,184(%r13)
        movl $Main_mainzuzdszdwfold_closure,%ebx
        addq $-24,%rbp
        movq %r14,(%rbp)
        movsd %xmm5,8(%rbp)
        movsd %xmm6,16(%rbp)
        jmp *-8(%r13)
.Lc1id:
        movsd %xmm6,%xmm0
        addsd .Ln1if(%rip),%xmm0
        movsd %xmm5,%xmm7
        addsd .Ln1ig(%rip),%xmm7
        leaq 1(%rax),%r14
        movsd %xmm7,%xmm5
        movsd %xmm0,%xmm6
        addq $-32,%r12
        jmp Main_mainzuzdszdwfold_info

Re: Removing/deprecating -fvia-c

Don Stewart