
bertram.felgenhauer:
This is odd, but it doesn't hurt the inner loop, which only involves $wsum01_XPd, and is identical to $wfold_s15t above.
Checking the asm: $ ghc -O2 -fasm
sQ3_info: .LcRt: cmpq 8(%rbp),%rsi jg .LcRw leaq 1(%rsi),%rax addq %rsi,%rbx movq %rax,%rsi jmp sQ3_info
So for some reason ghc ends up doing the (n + 1) addition before the (acc + n) addition in this case - this accounts for the extra instruction, because both n+1 and n need to be kept around for the duration of the addq (which does the acc + n addition).
Yep, well spotted.
Checking via C:
$ ghc -O2 -optc-O3 -fvia-C
Better code, but still a bit slower:
sQ3_info: cmpq 8(%rbp), %rsi jg .L8 addq %rsi, %rbx leaq 1(%rsi), %rsi jmp sQ3_info
This code is identical (up to renaming registers and one offset that I can't fully explain, but is probably related to a slight difference in handling pointer tags between the two versions of the code) to the "nice assembly" above.
Indeed, which is gratifying.
Running:
$ time ./B 500000000500000000 ./B 1.01s user 0.01s system 97% cpu 1.035 total
Hmm, about 5% slower, are you sure this isn't just noise?
If not noise, it may be some alignment effect. Hard to say.
I couldn't get it under 1s from a dozen runs, so assuming some small effect with alignment. Why we get the extra test in the outer loop though, not sure. That's new too I think -- at least I've not seen that pattern before. -- Don