
2009/2/28 Don Stewart
So now, since we've gone to such effort to produce a tiny loop like, this, can't we unroll it just a little? Sadly, my attempts to get GCC to trigger its loop unroller on this guy haven't succeeded. -funroll-loops and -funroll-all-loops doesn't touch it,
Anyone think of a way to apply Claus' TH unroller, or somehow convince GCC it is worth unrolling this guy, so we get the win of both aggressive high level fusion, and aggressive low level loop optimisations?
For a couple of weeks, I have had a working solution for the concatMap problem using a sort of loop unrolling. I have tweaked the approach slightly to also unroll the worker loop to get the results you desire. You can check out the (very rough) code with: git clone http://www.cl.cam.ac.uk/~mb566/git/concatmap/.git/ $EDITOR concatmap/CallUnrollConcatMap.hs Apologies if the code is somewhat cryptic, but you should be able to get the general idea. A sneak preview is in order. The following Core: """ Rec { $wf1_s1bU [ALWAYS LoopBreaker Nothing] :: GHC.Prim.Int# -> GHC.Prim.Int# [Arity 1 Str: DmdType L] $wf1_s1bU = \ (ww_s1bO :: GHC.Prim.Int#) -> case GHC.Prim.<=# ww_s1bO 100000000 of wild_B1 [ALWAYS Dead Just A] { GHC.Bool.False -> 0; GHC.Bool.True -> let { x_XMS [ALWAYS Just L] :: GHC.Prim.Int# [Str: DmdType] x_XMS = GHC.Prim.+# ww_s1bO 1 } in case GHC.Prim.<=# x_XMS 100000000 of wild_Xx [ALWAYS Dead Just A] { GHC.Bool.False -> GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# ww_s1bO 2) 2; GHC.Bool.True -> let { x_XMX [ALWAYS Just L] :: GHC.Prim.Int# [Str: DmdType] x_XMX = GHC.Prim.+# x_XMS 1 } in case GHC.Prim.<=# x_XMX 100000000 of wild_XE [ALWAYS Dead Just A] { GHC.Bool.False -> GHC.Prim.+# (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# ww_s1bO 2) 2) (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XMS 2) 2); GHC.Bool.True -> let { x_XOf [ALWAYS Just L] :: GHC.Prim.Int# [Str: DmdType] x_XOf = GHC.Prim.+# x_XMX 1 } in case GHC.Prim.<=# x_XOf 100000000 of wild_XM [ALWAYS Dead Just A] { GHC.Bool.False -> GHC.Prim.+# (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# ww_s1bO 2) 2) (GHC.Prim.+# (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XMS 2) 2) (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XMX 2) 2)); GHC.Bool.True -> case $wf1_s1bU (GHC.Prim.+# x_XOf 1) of ww_s1bS { __DEFAULT -> GHC.Prim.+# (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# ww_s1bO 2) 2) (GHC.Prim.+# (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XMS 2) 2) (GHC.Prim.+# (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XMX 2) 2) (GHC.Prim.+# (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XOf 2) 2) ww_s1bS))) } } } } } end Rec } """ Is generated by this program: """ result = sumS . mapS (*2) . mapS (`shiftL` 2) $ enumFromToS 0 100000000 """ Of course, my approach is far from perfect: * Unrolling ALWAYS happens, and to a fixed depth * RULEs aren't very good at exploiting properties of arithmetic, as Claus has pointed out * concatMap fuses with my library but has lingering issues with allocation if join points don't get inlined and has some strictness problems too (to see this in action, try compliing the program "sumS $ mapS (+10) $ concatMapS (\x -> enumFromToS x 20) $ enumFromToS 1 10" from the same file). It also is only permitted up to a fixed depth as defined by the level of unrolling specified in the "spec" combinator. But it does get your unrolling with TODAYs GHC, transparently to the user of the uvector library. I am currently looking at other, smarter, ways that GHC can optimize loops as part of my research - so with luck this sort of manual unrolling hackery will become less relevant in the future. All the best, Max