
I discovered something today I didn't know. gcc -O2 can optimise out the computed jumps GHC produces in tight loops. Consider this program, import Data.Array.Vector import Data.Bits main = print . sumU . mapU (*2) . mapU (`shiftL` 2) $ replicateU (100000000 :: Int) (5::Int) Yields this core: $wfold :: Int# -> Int# -> Int# $wfold = \ (ww_sMp :: Int#) (ww1_sMt :: Int#) -> case ww1_sMt of wild_X10 { __DEFAULT -> $wfold (+# ww_sMp 40) (+# wild_X10 1); 100000000 -> ww_sMp And -O2 -fasm: Main_zdwfold_info: movq %rdi,%rax cmpq $100000000,%rax jne .LcOk movq %rsi,%rbx jmp *(%rbp) .LcOk: incq %rax addq $40,%rsi movq %rax,%rdi jmp Main_zdwfold_info $ time ./sum 4000000000 ./sum 0.19s user 0.00s system 101% cpu 0.188 total -O2 -fvia-C -optc-O: Main_zdwfold_info: cmpq $100000000, %rdi jne .L3 movq %rsi, %rbx movq (%rbp), %rax .L4: jmp *%rax .L3: addq $40, %rsi leaq 1(%rdi), %rdi movl $Main_zdwfold_info, %eax jmp .L4 $ time ./sum 4000000000 ./sum 0.34s user 0.00s system 94% cpu 0.361 total Hmm. That movl, jmp .L4 ; jmp *%rax looks sucky, and performance got worse. And now with -O2 -fvia-C -optc-O2 Main_zdwfold_info: cmpq $100000000, %rdi je .L5 .L3: addq $40, %rsi leaq 1(%rdi), %rdi jmp Main_zdwfold_info $ time ./sum 4000000000 ./sum 0.11s user 0.02s system 106% cpu 0.122 total Woot, back in business. -- Don