
Don Stewart wrote:
If we take what I usually see as the best loops GHC can do for this kind of thing:
import Data.Array.Vector
main = print (sumU (enumFromToU 1 (10^9 :: Int)))
And compile it:
$ ghc-core A.hs -O2 -fvia-C -optc-O3
We get ideal core, all data structures fused away, and no heap allocation:
$wfold_s15t :: Int# -> Int# -> Int# $wfold_s15t = \ (ww1_s150 :: Int#) (ww2_s154 :: Int#) -> case ># ww2_s154 ww_s14U of wild_aWm { False -> $wfold_s15t (+# ww1_s150 ww2_s154) (+# ww2_s154 1); True -> ww1_s150 }; } in case $wfold_s15t 0 1
Which produces nice assembly:
s16e_info: cmpq 6(%rbx), %rdi jg .L2 addq %rdi, %rsi leaq 1(%rdi), %rdi jmp s16e_info
Note that this does the addition to the accumulator first, and then increments the counter. [snip]
I wondered if we just got worse code on backwards counting loops. So translating into the "obvious" translation, counting up:
main = print (sum0 0 1)
sum0 :: Int -> Int -> Int sum0 acc n | n > 10^9 = acc | otherwise = sum0 (acc + n) (n + 1)
We start to notice something interesting:
$wsum0 :: Int# -> Int# -> Int# $wsum0 = \ (ww_sOH :: Int#) (ww1_sOL :: Int#) -> case lvl2 of wild1_aHn { I# y_aHp -> case ># ww1_sOL y_aHp of wild_B1 { False -> letrec {
$wsum01_XPd :: Int# -> Int# -> Int# $wsum01_XPd = \ (ww2_XP4 :: Int#) (ww3_XP9 :: Int#) -> case ># ww3_XP9 y_aHp of wild11_Xs { False -> $wsum01_XPd (+# ww2_XP4 ww3_XP9) (+# ww3_XP9 1); True -> ww2_XP4 }; } in $wsum01_XPd (+# ww_sOH ww1_sOL) (+# ww1_sOL 1);
True -> ww_sOH }
This is odd, but it doesn't hurt the inner loop, which only involves $wsum01_XPd, and is identical to $wfold_s15t above.
Checking the asm: $ ghc -O2 -fasm
sQ3_info: .LcRt: cmpq 8(%rbp),%rsi jg .LcRw leaq 1(%rsi),%rax addq %rsi,%rbx movq %rax,%rsi jmp sQ3_info
So for some reason ghc ends up doing the (n + 1) addition before the (acc + n) addition in this case - this accounts for the extra instruction, because both n+1 and n need to be kept around for the duration of the addq (which does the acc + n addition).
Checking via C:
$ ghc -O2 -optc-O3 -fvia-C
Better code, but still a bit slower:
sQ3_info: cmpq 8(%rbp), %rsi jg .L8 addq %rsi, %rbx leaq 1(%rsi), %rsi jmp sQ3_info
This code is identical (up to renaming registers and one offset that I can't fully explain, but is probably related to a slight difference in handling pointer tags between the two versions of the code) to the "nice assembly" above.
Running:
$ time ./B 500000000500000000 ./B 1.01s user 0.01s system 97% cpu 1.035 total
Hmm, about 5% slower, are you sure this isn't just noise? If not noise, it may be some alignment effect. Hard to say. Bertram