Re: [Haskell-cafe] Re: speed: ghc vs gcc

20 Feb 2009

      Don Stewart wrote:
...
If we take what I usually see as the best loops GHC can do for this kind
of thing:
import Data.Array.Vector
main = print (sumU (enumFromToU 1 (10^9 :: Int)))
And compile it:
$ ghc-core A.hs -O2 -fvia-C -optc-O3
We get ideal core, all data structures fused away, and no heap allocation:
$wfold_s15t :: Int# -> Int# -> Int#
    $wfold_s15t =
      \ (ww1_s150 :: Int#) (ww2_s154 :: Int#) ->
        case ># ww2_s154 ww_s14U of wild_aWm {
          False ->
            $wfold_s15t
              (+# ww1_s150 ww2_s154) (+# ww2_s154 1);
          True -> ww1_s150
        }; } in
    case $wfold_s15t 0 1
Which produces nice assembly:
s16e_info:
      cmpq        6(%rbx), %rdi
      jg  .L2
      addq        %rdi, %rsi
      leaq        1(%rdi), %rdi
      jmp s16e_info
Note that this does the addition to the accumulator first, and then
increments the counter.

[snip]
...
I wondered if we just got worse code on backwards counting loops. So
translating into the "obvious" translation, counting up:
main = print (sum0 0 1)
sum0 :: Int -> Int -> Int
    sum0 acc n | n > 10^9  = acc
               | otherwise = sum0 (acc + n) (n + 1)
We start to notice something interesting:
$wsum0 :: Int# -> Int# -> Int#
    $wsum0 =
      \ (ww_sOH :: Int#) (ww1_sOL :: Int#) ->
        case lvl2 of wild1_aHn { I# y_aHp ->
        case ># ww1_sOL y_aHp of wild_B1 {
          False ->
            letrec {
$wsum01_XPd :: Int# -> Int# -> Int#
              $wsum01_XPd =
                \ (ww2_XP4 :: Int#) (ww3_XP9 :: Int#) ->
                  case ># ww3_XP9 y_aHp of wild11_Xs {
                    False ->
                      $wsum01_XPd (+# ww2_XP4 ww3_XP9) (+# ww3_XP9 1);
                    True -> ww2_XP4
                  }; } in
            $wsum01_XPd (+# ww_sOH ww1_sOL) (+# ww1_sOL 1);
True -> ww_sOH
    }
This is odd, but it doesn't hurt the inner loop, which only involves
$wsum01_XPd, and is identical to $wfold_s15t above.
...
Checking the asm:
    $ ghc -O2 -fasm
sQ3_info:
    .LcRt:
      cmpq 8(%rbp),%rsi
      jg .LcRw
      leaq 1(%rsi),%rax
      addq %rsi,%rbx
      movq %rax,%rsi
      jmp sQ3_info
So for some reason ghc ends up doing the (n + 1) addition before the
(acc + n) addition in this case - this accounts for the extra
instruction, because both n+1 and n need to be kept around for the
duration of the addq (which does the acc + n addition).
...
Checking via C:
$ ghc -O2 -optc-O3 -fvia-C
Better code, but still a bit slower:
sQ3_info:
          cmpq        8(%rbp), %rsi
          jg  .L8
          addq        %rsi, %rbx
          leaq        1(%rsi), %rsi
          jmp sQ3_info
This code is identical (up to renaming registers and one offset that
I can't fully explain, but is probably related to a slight difference
in handling pointer tags between the two versions of the code) to the
"nice assembly" above.
...
Running:
$ time   ./B
        500000000500000000
        ./B  1.01s user 0.01s system 97% cpu 1.035 total
Hmm, about 5% slower, are you sure this isn't just noise?

If not noise, it may be some alignment effect. Hard to say.

Bertram

Re: [Haskell-cafe] Re: speed: ghc vs gcc

Bertram Felgenhauer