Re: Removing/deprecating -fvia-c

15 Feb 2010

      dons:
...
marlowsd:
...
...
...
Simon Marlow has recently fixed FP performance for modern x86 chips in
the native code generator in the HEAD. That was the last reason we know
of to prefer via-C to the native code generators. But before we start
the removal process, does anyone know of any other problems with the
native code generators that need to be fixed first?
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
As recently as last year -fvia-C -optc-O3 was still useful for some
microbenchmarks -- what's changed in that time, or is expected to change?
If you have benchmarks that show a significant difference, I'd be  
interested to see them.
I've attached an example where there's a 40% variation (and it's a
floating point benchmark). Roman would be seeing similar examples in the
vector code.
Here's an example that doesn't use floating point:

    import Data.Array.Vector
    import Data.Bits

    main = print . sumU $ zipWith3U (\x y z -> x * y * z)
                            (enumFromToU 1 (100000000 :: Int))
                            (enumFromToU 2 (100000001 :: Int))
                            (enumFromToU 7 (100000008 :: Int))

In core:

    main_$s$wfold :: Int# -> Int# -> Int# -> Int# -> Int#
    main_$s$wfold =
      \ (sc_s1l1 :: Int#)
        (sc1_s1l2 :: Int#)
        (sc2_s1l3 :: Int#)
        (sc3_s1l4 :: Int#) ->
        case ># sc2_s1l3 100000000 of _ {
          False ->
            case ># sc1_s1l2 100000001 of _ {
              False ->
                case ># sc_s1l1 100000008 of _ {
                  False ->
                    main_$s$wfold
                      (+# sc_s1l1 1)
                      (+# sc1_s1l2 1)
                      (+# sc2_s1l3 1)
                      (+#
                         sc3_s1l4 (*# (*# sc2_s1l3 sc1_s1l2) sc_s1l1));
                  True -> sc3_s1l4
                };
              True -> sc3_s1l4
            };
          True -> sc3_s1l4
        }

Rather nice!

-fvia-C -optc-O3

    Main_mainzuzdszdwfold_info:
            cmpq    $100000000, %rdi
            jg      .L6
            cmpq    $100000001, %rsi
            jg      .L6
            cmpq    $100000008, %r14
            jle     .L10
    .L6:
            movq    %r8, %rbx
            movq    (%rbp), %rax
            jmp     *%rax
    .L10:
            movq    %rsi, %r10
            leaq    1(%rsi), %rsi
            imulq   %rdi, %r10
            leaq    1(%rdi), %rdi
            imulq   %r14, %r10
            leaq    1(%r14), %r14
            leaq    (%r10,%r8), %r8
            jmp     Main_mainzuzdszdwfold_info

Which looks ok.

    $ time ./zipwith3                          
    3541230156834269568
    ./zipwith3  0.33s user 0.00s system 99% cpu 0.337 total

And -fasm we get very different code, and a bit of a slowdown:

    Main_mainzuzdszdwfold_info:
    .Lc1mo:
            cmpq $100000000,%rdi
            jg .Lc1mq
            cmpq $100000001,%rsi
            jg .Lc1ms
            cmpq $100000008,%r14
            jg .Lc1mv

            movq %rsi,%rax
            imulq %r14,%rax
            movq %rdi,%rcx
            imulq %rax,%rcx
            movq %r8,%rax
            addq %rcx,%rax
            leaq 1(%rdi),%rcx
            leaq 1(%rsi),%rdx
            incq %r14
            movq %rdx,%rsi
            movq %rcx,%rdi
            movq %rax,%r8
            jmp Main_mainzuzdszdwfold_info

    .Lc1mq:
            movq %r8,%rbx
            jmp *(%rbp)
    .Lc1ms:
            movq %r8,%rbx
            jmp *(%rbp)
    .Lc1mv:
            movq %r8,%rbx
            jmp *(%rbp)

Slower:

    $ time ./zipwith3
    3541230156834269568
    ./zipwith3  0.38s user 0.00s system 98% cpu 0.384 total

Now maybe we need to wait on the new backend optimizations to get there?

-- Don

Re: Removing/deprecating -fvia-c

Don Stewart