
So now, since we've gone to such effort to produce a tiny loop like, this, can't we unroll it just a little? it is worth unrolling this guy, so we get the win of both aggressive high level fusion, and aggressive low level loop optimisations?
It might be useful to point out that the interaction goes both ways. Not only are fused loops candidates for unrolling, but unrolling can also enable fusion, giving one example of why Core-level unrolling (in addition to backend-level loop restructuring) would be useful. Consider this silly example (with Apply as before, in the rewrite rules thread, just syntactically unrolling the loop, and loop as before, but generalised to arbitrary accumulators, see below): -------------------------------------------------------- {-# LANGUAGE MagicHash #-} {-# LANGUAGE CPP #-} {-# LANGUAGE TemplateHaskell #-} {-# LANGUAGE BangPatterns #-} import Data.Array.Vector import Data.Bits import Apply import GHC.Prim import GHC.Base main = print $ loop 1 10000000 body (toU [1,2,3,4,5::Int]) body i arr = mapU (42+) arr -------------------------------------------------------- Here, the refusal to partially unfold recursive definitions means there are no opportunities for fusion, whereas unrolling enables fusion (which wouldn't work if unrolling was done only in the backend, after fusion). -------------------------------------------------------- {-# INLINE loop #-} loop :: Int -> Int -> (Int -> acc -> acc) -> acc -> acc loop i max body acc = loopW i acc where #ifdef N loopW !i !acc | i+N<=max = loopW (i+N) ($(apply (0::Int) N) (\j acc->body (i+j) acc) acc) #endif loopW !i !acc | i<=max = loopW (i+1) (body i acc) | otherwise = acc -------------------------------------------------------- Compare the versions without and with unrolling, not just for time, but for allocation (+RTS -s). As usual, we'd like to reassociate the sums to enable constant folding, but this rule {-# RULES -- "reassoc" forall a# b# c. ((I# a#) +# ((I# b#) +# c)) = ((I# a#) +# (I# b#)) +# c #-} is rejected. Claus