
Hey guys, We have nice fusion frameworks now. E.g. stream fusion on uvector, http://hackage.haskell.org/cgi-bin/hackage-scripts/package/uvector Takes something like this: import Data.Array.Vector import Data.Bits main = print . productU . mapU (*2) . mapU (`shiftL` 2) $ replicateU (100000000 :: Int) (5::Int) and turns it into a loop like this: $wfold :: Int# -> Int# -> Int# $wfold = \ (ww_sWX :: Int#) (ww1_sX1 :: Int#) -> case ww1_sX1 of wild_B1 { __DEFAULT -> $wfold (*# ww_sWX 40) (+# wild_B1 1); 100000000 -> ww_sWX } Now, that's fine in my book. Going via -fasm, we get: Main_zdwfold_info: .LcYt: movq %rdi,%rax cmpq $100000000,%rax jne .LcYx movq %rsi,%rbx jmp *(%rbp) .LcYx: incq %rax imulq $40,%rsi movq %rax,%rdi jmp Main_zdwfold_info Ok: $ time ./product 0 ./product 0.31s user 0.00s system 96% cpu 0.316 total Going via C, however, we get: Main_zdwfold_info: cmpq $100000000, %rdi je .L6 .L2: leaq (%rsi,%rsi,4), %rax leaq 1(%rdi), %rdi leaq 0(,%rax,8), %rsi jmp Main_zdwfold_info Nice! $ time ./product 0 ./product 0.19s user 0.00s system 97% cpu 0.197 total So now, since we've gone to such effort to produce a tiny loop like, this, can't we unroll it just a little? Sadly, my attempts to get GCC to trigger its loop unroller on this guy haven't succeeded. -funroll-loops and -funroll-all-loops doesn't touch it, Anyone think of a way to apply Claus' TH unroller, or somehow convince GCC it is worth unrolling this guy, so we get the win of both aggressive high level fusion, and aggressive low level loop optimisations? -- Don