
claus.reinke:
Concrete examples always help, thanks.
In simple enough situations, one can roll one's own loop unrolling;), somewhat like shown below (worker/wrapper split to bring the function parameter representing the loop body into scope, then template haskell to unroll its applications syntactically, then optimization by transformation to get rid of the extra code). It is all rather more complicated than one would like it to be, what with TH scoping restrictions and all, but perhaps a library of self-unrolling loop combinators along these lines might help, as a workaround until ghc does its own unrolling.
Claus
{-# LANGUAGE TemplateHaskell #-} module Apply where import Language.Haskell.TH.Syntax apply i bound | i
$(apply (i+1) bound) f (f i x) |] | otherwise = [| \f x -> x |] {-# LANGUAGE CPP #-} {-# LANGUAGE TemplateHaskell #-} {-# LANGUAGE BangPatterns #-} {-# OPTIONS_GHC -DN=8 -ddump-splices #-} module Main(main) where import Apply main = print $ loopW 1 (10^9) body 0
{-# INLINE loopW #-} loopW :: Int -> Int -> (Int -> Int -> Int) -> Int -> Int loopW i max body acc = loop i acc where loop :: Int -> Int -> Int loop !i !acc | i+N<=max = loop (i+N) ($(apply (0::Int) N) (\j acc->body (i+j) acc) acc) {- loop !i !acc | i+8<=max = loop (i+8) ( body (i+7) $ body (i+6) $ body (i+5) $ body (i+4) $ body (i+3) $ body (i+2) $ body (i+1) $ body i acc) -} loop !i !acc | i<=max = loop (i+1) (body i acc) | otherwise = acc
body :: Int -> Int -> Int body !i !acc = i+acc
Great thinking! This is EXTREMELY COOL! Main.hs:15:42-57: Splicing expression let apply = apply $dOrd = GHC.Base.$f1 $dNum = GHC.Num.$f6 $dLift = Language.Haskell.TH.Syntax.$f18 in apply (0 :: Int) 8 ======> \ f[a1KU] x[a1KV] -> \ f[a1KW] x[a1KX] -> \ f[a1KY] x[a1KZ] -> \ f[a1L0] x[a1L1] -> \ f[a1L2] x[a1L3] -> \ f[a1L4] x[a1L5] -> \ f[a1L6] x[a1L7] -> \ f[a1L8] x[a1L9] -> \ f[a1La] x[a1Lb] -> x[a1Lb] f[a1L8] (f[a1L8] 7 x[a1L9]) f[a1L6] (f[a1L6] 6 x[a1L7]) f[a1L4] (f[a1L4] 5 x[a1L5]) f[a1L2] (f[a1L2] 4 x[a1L3]) f[a1L0] (f[a1L0] 3 x[a1L1]) f[a1KY] (f[a1KY] 2 x[a1KZ]) f[a1KW] (f[a1KW] 1 x[a1KX]) f[a1KU] (f[a1KU] 0 x[a1KV]) In the second argument of `loop', namely `($(apply (0 :: Int) 8) (\ j acc -> body (i + j) acc) acc)' In the expression: loop (i + 8) ($(apply (0 :: Int) 8) (\ j acc -> body (i + j) acc) acc) In the definition of `loop': loop !i !acc | i + 8 <= max = loop (i + 8) ($(apply (0 :: Int) 8) (\ j acc -> body (i + j) acc) acc) So, that's the fastest yet: $ time ./Main 500000000500000000 ./Main 0.61s user 0.00s system 98% cpu 0.623 total And within 2x the best GCC was doing, gcc -O3 -funroll-loops 0.318 If we unroll even further... $ ghc -O2 -fvia-C -optc-O3 -D64 Main.hs $ time ./Main 500000000500000000 ./Main 0.08s user 0.00s system 94% cpu 0.088 total Very very nice, Claus! Now I'm wondering if we can do this via rewrite rules.... -- Don