
On Friday 30 May 2008, Duncan Coutts wrote:
This is for two reasons. One is because your second foldl' is directly recursive so does not get inlined. The static argument transformation it what you're doing manually to turn the latter into the former. The SAT is implemented in ghc 6.9 (though it seems to be having integration problems).
Apologies for replying to this thread after around a month, but I've been looking at performance of code with and without this transformation a bit lately, and I'm rather curious what is being planned to go on in GHC 6.10 with regard to it. Specifically, will GHC simply always perform the static argument transform, or will it have some kind of heuristic to decide when it's useful? It seems, according to some tests I've done, that it's not always a win, and is sometimes a loss (and not just with regard to code duplication). I've been hard pressed to come up with a definite rule for when it's a benefit, other than simply testing and seeing what works. For instance, consider some code from a recent benchmark I posted (I'm running this all under 6.8.2): bench :: Int -> Int -> ST s () bench (I# k) (I# n) = ST go where go s = case sizeOf (0 :: Int) of { I# w -> case newByteArray# (n *# w) s of { (# s, arr #) -> case fill arr n s of { s -> go' arr k s } } } go' arr 0# s = (# s, () #) go' arr k s = case reverse arr 0# (n -# 1#) s of { s -> go' arr (k -# 1#) s } reverse :: MutableByteArray# s -> Int# -> Int# -> State# s -> State# s reverse arr i j s | i <# j = case readIntArray# arr i s of { (# s, ei #) -> case readIntArray# arr j s of { (# s, ej #) -> case writeIntArray# arr j ei s of { s -> case writeIntArray# arr i ej s of { s -> reverse arr (i +# 1#) (j -# 1#) s } } } } | otherwise = s The above is without the SAT on 'arr', obviously. This code runs fast. Doing SAT on reverse, but not on go' causes lots of heap allocation (for high iteration counts), slowing things way down. Doing it on go' but not reverse yields results about the same as not doing it at all. Doing it on both go' *and* reverse eliminates the heap allocation of the first case, *but* it results in overall slower code (which may, in fact, invalidate my bug about MBA# performance; I'll have to check. Just goes to show, even the most straightforward benchmark may not be measuring what you intend to measure). Another example is this function from the MBA# version of the fannkuch benchmark: shift :: MutableByteArray# s -> Int# -> State# s -> State# s shift arr r s = case readIntArray# arr 0# s of { (# s, p0 #) -> case go arr 0# r s of { s -> writeIntArray# arr r p0 s } } where go arr i r s | i <# r = case readIntArray# arr (i +# 1#) s of { (# s, e #) -> case writeIntArray# arr i e s of { s -> go arr (i +# 1#) r s } } | otherwise = s {-# INLINE shift #-} Now, obviously, arr and r are static arguments for go (already in scope, no less), so one would likely write go without them (it's certainly what I did initially). However, eliminating either one of them causes heap allocation and, thus, slowdowns (eliminating both is even worse). So, I suppose my question is: will GHC be better at this than I am? :) Will it know that performing the transform on shift above would cause extra heap allocation/slowdowns? Will it know that transforming reverse (and go') is slower than not doing so (I suppose it may be suspicious that this is all low-level code, but I've tried fiddling with higher-level code that I know gets compiled to this sort of code, and the above results seemed to hold)? I'm also not yet skilled enough at reading assembly to figure out what exactly is causing all this. Are values getting kicked out of registers or some such? Cheers, -- Dan