Recursive functions and constant parameter closures (inlining/strictness analyzer question)

main = print $ foldl' (+) 0 [1..] with foldl' f y xs = foldl' y xs where foldl' y [] = y foldl' y (x:xs) = foldl' (f y x) xs runs indefinitely with very little memory consumption, while foldl' f y [] = y foldl' f y (x:xs) = foldl' f (f y x) xs rapidly consumes all the machine's memory and dies. Running ghc with -ddump-stranal shows the outer foldl' of the first gets inlined into main as a call to the following specialized version of the inner foldl': foldl'_sSY [ALWAYS LoopBreaker Nothing] :: GHC.Num.Integer -> [GHC.Num.Integer] -> GHC.Num.Integer [Arity 2 Str: DmdType SS] foldl'_sSY = \ (y_aj7 [ALWAYS Just S] :: GHC.Num.Integer) (ds_dQl [ALWAYS Just S] :: [GHC.Num.Integer]) -> case ds_dQl of wild_XH [ALWAYS Just A] { [] -> y_aj7; : x_aja [ALWAYS Just S] xs_ajb [ALWAYS Just S] -> foldl'_sSY (GHC.Num.plusInteger y_aj7 x_aja) xs_ajb } Doing the same with the second foldl' shows it to remains non-inlined and fully polymorphic: foldl'_sQN [ALWAYS LoopBreaker Nothing] :: forall t_auW t_av2. (t_av2 -> t_auW -> t_av2) -> t_av2 -> [t_auW] -> t_av2 [Arity 3 Str: DmdType LLS] foldl'_sQN = \ (@ t_auW) (@ t_av2) (f_aj0 [ALWAYS Just L] :: t_av2 -> t_auW -> t_av2) (y_aj1 [ALWAYS Just L] :: t_av2) (ds_dQg [ALWAYS Just S] :: [t_auW]) -> case ds_dQg of wild_XK [ALWAYS Just A] { [] -> y_aj1; : x_aj5 [ALWAYS Just L] xs_aj6 [ALWAYS Just S] -> foldl'_sQN @ t_auW @ t_av2 f_aj0 (f_aj0 y_aj1 x_aj5) xs_aj6 } Forcing it inline with {-# INLINE foldl' #-} just specialized it: foldl'_sSS [ALWAYS LoopBreaker Nothing] :: (GHC.Num.Integer -> GHC.Num.Integer -> GHC.Num.Integer) -> GHC.Num.Integer -> [GHC.Num.Integer] -> GHC.Num.Integer [Arity 3 Str: DmdType LLS] foldl'_sSS = \ (f_aj0 [ALWAYS Just L] :: GHC.Num.Integer -> GHC.Num.Integer -> GHC.Num.Integer) (y_aj1 [ALWAYS Just L] :: GHC.Num.Integer) (ds_dQg [ALWAYS Just S] :: [GHC.Num.Integer]) -> case ds_dQg of wild_XI [ALWAYS Just A] { [] -> y_aj1; : x_aj5 [ALWAYS Just L] xs_aj6 [ALWAYS Just S] -> foldl'_sSS f_aj0 (f_aj0 y_aj1 x_aj5) xs_aj6 } I thought this was interesting. Is it to be expected? Am I right in interpreting this to mean it was just too much for the strictness analyzer. I believe the first ultimately produces significantly superior code, so should one always write their recursive functions such that the constant (functional?) parameters are first captured in a closure? In that vein, would it be useful if the compiler automatically transformed the second into the first? Thanks! -Tyson

On Thursday 29 May 2008, Tyson Whitehead wrote:
I thought this was interesting. Is it to be expected? Am I right in interpreting this to mean it was just too much for the strictness analyzer. I believe the first ultimately produces significantly superior code, so should one always write their recursive functions such that the constant (functional?) parameters are first captured in a closure?
In that vein, would it be useful if the compiler automatically transformed the second into the first?
I've had similar experiences. I've been working on sorting the mutable arrays in the new uvector library, and on at least one occasion, changing from code like: foo f x y = ... foo f x' y' to code like: foo f = loop where loop x y = ... loop x' y' resulted in 50% faster code (that is, 10 seconds -> 5 seconds). It doesn't always make such a dramatic difference, but it seems significant enough to have gotten me in the habit of writing such code by default (for the sorting, at least, where I'm trying to squeeze out as much performance as possible). Unfortunately, I think the resulting code is somewhat ugly, as I occasionally end up with code like: foo a b = loop1 c where loop1 c d = loop2 e where loop2 e = ... which may or may not actually be faster with more or less nested loops, but trying each possibility is a bit of a pain. If the compiler could automatically derive the more straightforward definition into the fast one, it'd be quite a boon, but I haven't thought very hard about whether there are any pitfalls to doing so. -- Dan

On Thu, 2008-05-29 at 23:48 -0400, Tyson Whitehead wrote:
main = print $ foldl' (+) 0 [1..]
with
foldl' f y xs = foldl' y xs where foldl' y [] = y foldl' y (x:xs) = foldl' (f y x) xs
runs indefinitely with very little memory consumption, while
foldl' f y [] = y foldl' f y (x:xs) = foldl' f (f y x) xs
rapidly consumes all the machine's memory and dies.
This is for two reasons. One is because your second foldl' is directly recursive so does not get inlined. The static argument transformation it what you're doing manually to turn the latter into the former. The SAT is implemented in ghc 6.9 (though it seems to be having integration problems). The reason the second version consumes all the memory is because it is not strict in the accumulator. You're misleading yourself by calling it foldl', you've actually written the standard foldl. The reason the first version does not consume all the memory is because once foldl' got inlined there is enough information for the strictness analyser to see that it is indeed all strict (for the particular parameters, not for all possible parameters as is the case with the real foldl'). Duncan

On Friday 30 May 2008, Duncan Coutts wrote:
This is for two reasons. One is because your second foldl' is directly recursive so does not get inlined. The static argument transformation it what you're doing manually to turn the latter into the former. The SAT is implemented in ghc 6.9 (though it seems to be having integration problems).
Apologies for replying to this thread after around a month, but I've been looking at performance of code with and without this transformation a bit lately, and I'm rather curious what is being planned to go on in GHC 6.10 with regard to it. Specifically, will GHC simply always perform the static argument transform, or will it have some kind of heuristic to decide when it's useful? It seems, according to some tests I've done, that it's not always a win, and is sometimes a loss (and not just with regard to code duplication). I've been hard pressed to come up with a definite rule for when it's a benefit, other than simply testing and seeing what works. For instance, consider some code from a recent benchmark I posted (I'm running this all under 6.8.2): bench :: Int -> Int -> ST s () bench (I# k) (I# n) = ST go where go s = case sizeOf (0 :: Int) of { I# w -> case newByteArray# (n *# w) s of { (# s, arr #) -> case fill arr n s of { s -> go' arr k s } } } go' arr 0# s = (# s, () #) go' arr k s = case reverse arr 0# (n -# 1#) s of { s -> go' arr (k -# 1#) s } reverse :: MutableByteArray# s -> Int# -> Int# -> State# s -> State# s reverse arr i j s | i <# j = case readIntArray# arr i s of { (# s, ei #) -> case readIntArray# arr j s of { (# s, ej #) -> case writeIntArray# arr j ei s of { s -> case writeIntArray# arr i ej s of { s -> reverse arr (i +# 1#) (j -# 1#) s } } } } | otherwise = s The above is without the SAT on 'arr', obviously. This code runs fast. Doing SAT on reverse, but not on go' causes lots of heap allocation (for high iteration counts), slowing things way down. Doing it on go' but not reverse yields results about the same as not doing it at all. Doing it on both go' *and* reverse eliminates the heap allocation of the first case, *but* it results in overall slower code (which may, in fact, invalidate my bug about MBA# performance; I'll have to check. Just goes to show, even the most straightforward benchmark may not be measuring what you intend to measure). Another example is this function from the MBA# version of the fannkuch benchmark: shift :: MutableByteArray# s -> Int# -> State# s -> State# s shift arr r s = case readIntArray# arr 0# s of { (# s, p0 #) -> case go arr 0# r s of { s -> writeIntArray# arr r p0 s } } where go arr i r s | i <# r = case readIntArray# arr (i +# 1#) s of { (# s, e #) -> case writeIntArray# arr i e s of { s -> go arr (i +# 1#) r s } } | otherwise = s {-# INLINE shift #-} Now, obviously, arr and r are static arguments for go (already in scope, no less), so one would likely write go without them (it's certainly what I did initially). However, eliminating either one of them causes heap allocation and, thus, slowdowns (eliminating both is even worse). So, I suppose my question is: will GHC be better at this than I am? :) Will it know that performing the transform on shift above would cause extra heap allocation/slowdowns? Will it know that transforming reverse (and go') is slower than not doing so (I suppose it may be suspicious that this is all low-level code, but I've tried fiddling with higher-level code that I know gets compiled to this sort of code, and the above results seemed to hold)? I'm also not yet skilled enough at reading assembly to figure out what exactly is causing all this. Are values getting kicked out of registers or some such? Cheers, -- Dan

Hi Dan, I've only got time for a quick reply now, I'll see if I can take a more substantitative look at your examples next week.
Specifically, will GHC simply always perform the static argument transform, or will it have some kind of heuristic to decide when it's useful? It seems, according to some tests I've done, that it's not always a win, and is sometimes a loss (and not just with regard to code duplication). I've been hard pressed to come up with a definite rule for when it's a benefit, other than simply testing and seeing what works.
Yes. Basically the rule is that if the recursive function under consideration makes few dynamic iterations then the cost of closure allocation in the SATed version outweighs the benefits of reduced copying etc. Clearly it is quite difficult for the compiler to tell if this is likely to be the case! This has always been a well documented problem with SAT from Santos' thesis where it was introduced and is why it went unimplemented in GHC for so long.
(elided examples, no time to really get into them ATM but they seem to demonstrate this principle).
So, I suppose my question is: will GHC be better at this than I am? :) Will it know that performing the transform on shift above would cause extra heap allocation/slowdowns? Will it know that transforming reverse (and go') is slower than not doing so (I suppose it may be suspicious that this is all low-level code, but I've tried fiddling with higher-level code that I know gets compiled to this sort of code, and the above results seemed to hold)?
In short: no it will not be better at it than you! GHC performs the SAT iff the recursive call is direct (i.e. not on mutually recursive groups of functions) to reduce code expansion, and if the number of static arguments is at least 2. The reason for the last criterion is that moving a parameter to a closure implicitly adds an argument to all the functions that make reference to that variable, the implicit argument being the pointer to the closure. Eliminating an actual function argument just to add a layer of indirection via an allocated closure would be fairly pointless! There is no cleverer criterion involved, and this one was chosen just because it led to good results in the nofib benchmark suite that we consider to comprise "typical" Haskell programs for the purposes of evaluating optimisations. There will certainly be cases where it slows down the program under consideration, though the same can be said for almost any compiler optimisation ("no free lunch"). Of course, if you have any suggestions for good heuristics based on your benchmarking experience then we would like to hear them! There was some discussion of this in the original ticket, http://hackage.haskell.org/trac/ghc/ticket/888, but when implementing SAT I tried out the suggestions made there without good results (though to be perfectly honest I didn't really understand the motivations behind some of the suggestions made). Cheers, Max

On Friday 20 June 2008, Max Bolingbroke wrote:
Of course, if you have any suggestions for good heuristics based on your benchmarking experience then we would like to hear them! There was some discussion of this in the original ticket, http://hackage.haskell.org/trac/ghc/ticket/888, but when implementing SAT I tried out the suggestions made there without good results (though to be perfectly honest I didn't really understand the motivations behind some of the suggestions made).
Well, as I said, I was apparently being over-aggressive in my manual application of the transform, so I'm not yet the best person to ask, I suppose. :) However, if I had to pick something out of the air, I'd say this: always do SAT when the argument in question is a function. This is actually the reason I started doing it in the first place. I was working on sorting uvectors, and had what I thought was pretty good performance, and then I did SAT on the comparison function, and suddenly my code was running in half the time. Going back to my sorting (testing on introsort), SAT for the array doesn't seem to do much, but undoing the SAT for the comparison function causes heap allocation to balloon, and everything gets at least a factor of 2 slower. This also seems to match the other manual uses of SAT I've seen in the past. For instance, in the GHC libraries you'll see: foldr f = go where go z (x:xs) = f x (go z xs) go z [ ] = z because it's significantly faster, while earlier in this thread, SPJ said that: xs ++ ys = let go [ ] = ys go (z:zs) = z : go zs in go xs isn't much of a win. I don't have much data currently, but the places I *know* it's been a win have been SAT on functions, while the places I *know* it's been a loss have been SAT on unboxed data. Whether this is because SAT allows more opportunity for inlining/specialization on functions, or because copying functions is inherently more expensive, though, I don't know. I'll keep an eye open from now on, though. (On an additional tentative note, undoing SAT in the first example of my last mail seemed to eliminate the difference between the native code generator and -fvia-c -optc-O3. So whatever SAT was doing to make the code slower (no extra heap allocation, just slower code), GCC may have been fixing. My off-the-top-of-the-head guess was that passing the arguments along kept them in registers, and maybe GCC was putting the SATed arguments back into registers, but as I said, I don't currently have the expertise to verify that.) Cheers, -- Dan

| However, if I had to pick something out of the air, I'd say this: always do | SAT when the argument in question is a function. Yes, that might well be a good heuristic to try, if you are interested to pursue this, Max. Making the function static means that it may be inlined, and that can make a tremendous difference by specializing the loop for that particular function. But that in turn only tends to happen if the enclosing function is inlined. Consider foldr: the real payoff comes when foldr is inlined, so that the function at the call site becomes visible. Simon

2008/6/22 Simon Peyton-Jones
| However, if I had to pick something out of the air, I'd say this: always do | SAT when the argument in question is a function.
Yes, that might well be a good heuristic to try, if you are interested to pursue this, Max. Making the function static means that it may be inlined, and that can make a tremendous difference by specializing the loop for that particular function. But that in turn only tends to happen if the enclosing function is inlined. Consider foldr: the real payoff comes when foldr is inlined, so that the function at the call site becomes visible.
I spent quite a bit of time today playing with various heuristics for the SAT including the one suggested above (which I agree sounded promising) but didn't make much progress. The best results were obtained by applying the SAT when: 1) At least two arguments are static, whatever their types 2) Or if at least one of the static arguments is a function type and the SATed functions right hand side is small enough to inline 3) As long as the function is not likely to be compiled as a tail call (because we effectively get staticness there by carrying arguments on the stack) If you just use criteria 2 then you get some bad worst cases in runtime (e.g. Para.dropTail and Constraints.initTree) because the additional inlining opportunity was not utilised enough to overcome the heap allocation cost. If you don't SAT functions that have only non-function static arguments then some other worst cases crop up (e.g. Rsa.power) where we actually quite want to do it. However this represents only a 0.3% decrease in allocations and runtime from the previous heuristic for nofib! Furthermore, it was necessary to run the SAT twice in the compiler pipeline to catch both those functions that only become tail calls after e.g. the let-to-case transform AND to identify the extra common subexpressions producted by SAT early enough that they could have good use made of them by later compiler passes. I'm not particularly happy with the tail call criterion because it's quite fragile to make inferences about how the function will be treated by codegen as early in the pipeline as SAT is running. So in summary I don't think this change is worth integrating. Cheers, Max

there's no chance for the lower-level near code generation to reverse-SAT to eliminate the heap usage? (which would obviously be a different optimization that might be useful in other ways too, if it could be made to work) (did someone say that -fvia-C with gcc -O3 managed to do that sometimes?) just an underinformed thought, -Isaac

On Monday 23 June 2008, Isaac Dupree wrote:
there's no chance for the lower-level near code generation to reverse-SAT to eliminate the heap usage? (which would obviously be a different optimization that might be useful in other ways too, if it could be made to work) (did someone say that -fvia-C with gcc -O3 managed to do that sometimes?)
I said something similar, but I don't actually know the details. What I have is a SAT and non-SAT version of a benchmark: ---- SAT ---- bench :: Int -> Int -> ST s () bench (I# k) (I# n) = ST go where go s = case sizeOf (0 :: Int) of { I# w -> case newByteArray# (n *# w) s of { (# s, arr #) -> case fill arr n s of { s -> go' arr k s } } } go' arr = go'' where go'' 0# s = (# s, () #) go'' k s = case reverse arr 0# (n -# 1#) s of { s -> go'' (k -# 1#) s } reverse :: MutableByteArray# s -> Int# -> Int# -> State# s -> State# s reverse arr = go where go i j s | i <# j = case readIntArray# arr i s of { (# s, ei #) -> case readIntArray# arr j s of { (# s, ej #) -> case writeIntArray# arr j ei s of { s -> case writeIntArray# arr i ej s of { s -> go (i +# 1#) (j -# 1#) s } } } } | otherwise = s ---- non-SAT ---- bench :: Int -> Int -> ST s () bench (I# k) (I# n) = ST go where go s = case sizeOf (0 :: Int) of { I# w -> case newByteArray# (n *# w) s of { (# s, arr #) -> case fill arr n s of { s -> go' arr k s } } } go' arr 0# s = (# s, () #) go' arr k s = case reverse arr 0# (n -# 1#) s of { s -> go' arr (k -# 1#) s } reverse :: MutableByteArray# s -> Int# -> Int# -> State# s -> State# s reverse arr i j s | i <# j = case readIntArray# arr i s of { (# s, ei #) -> case readIntArray# arr j s of { (# s, ej #) -> case writeIntArray# arr j ei s of { s -> case writeIntArray# arr i ej s of { s -> reverse arr (i +# 1#) (j -# 1#) s } } } } | otherwise = s In the SAT version, the native code generator is noticeably slower than -fvia-c -optc-O3, so the latter is doing something right. In the non-SAT version, -fvia-c is somewhat faster than the SAT version, but the NCG is just as fast. I don't know what's going on on the C end in the first case that makes it better, though (heap allocation is the same (low) in all cases; via-c doesn't fix large heap allocation when it happens in my experience). As an aside, I finally got around to compiling 6.9 from a day or so ago, and noticed that one of the functions I'd mentioned earlier does get SATed to the bad version (which causes lots of heap allocation in the program) because it has two static arguments. However, marking it NOINLINE seemed to keep it from having the negative effects. Does that pragma keep SAT from doing its thing, and if so, will that continue to be viable in cases where we decide we know better than GHC? Cheers, -- Dan

2008/6/23 Dan Doel
On Monday 23 June 2008, Isaac Dupree wrote:
there's no chance for the lower-level near code generation to reverse-SAT to eliminate the heap usage?
It might be possible to do this by utilising the rules system. It's something I had briefly considered but not really looked into. As there seems to be a lot of interest in this topic I'll probably devote some more time to the issue tomorrow and see if we can't get SAT working really well.
(did someone say that -fvia-C with gcc -O3 managed to do that sometimes?)
I said something similar, but I don't actually know the details. What I have is a SAT and non-SAT version of a benchmark
(snip)
In the SAT version, the native code generator is noticeably slower than -fvia-c -optc-O3, so the latter is doing something right. In the non-SAT version, -fvia-c is somewhat faster than the SAT version, but the NCG is just as fast. I don't know what's going on on the C end in the first case that makes it better, though (heap allocation is the same (low) in all cases; via-c doesn't fix large heap allocation when it happens in my experience).
Having a compact benchmark is great, I'll take a look at what's going on tomorrow, though it will involve my plumbing the as-yet unknown depths of the code generation system :-)
As an aside, I finally got around to compiling 6.9 from a day or so ago, and noticed that one of the functions I'd mentioned earlier does get SATed to the bad version (which causes lots of heap allocation in the program) because it has two static arguments. However, marking it NOINLINE seemed to keep it from having the negative effects. Does that pragma keep SAT from doing its thing, and if so, will that continue to be viable in cases where we decide we know better than GHC?
I assume you mean "shift" as all the others have only one static argument. Personally I find the behaviour of not SATing if something is NOINLINE surprising and I suspect something else is at work. Is is possible for you to send me a self-contained example of the problematic function that demonstrates this off-list that I can play with myself? It's worth noting that if you think SAT is a bad idea you can always resort to -fno-static-argument-transformation (IIRC) in an OPTIONS_GHC pragma. Incidentally I think the introduction of some sort of tail call detection in the SAT would resolve the problems with all the examples you've presented so far since they are all straight loops. Annoyingly this is not a clear win because I think doing this in general will impede loop invariant lifting e.g consider this tail recursive function: foo x y z = let invariant = x * y in foo x y (z + 1) This is static in x and y so we might like to SAT it: foo x y z = let foo' z = (let invariant = x * y in foo' (z + 1)) in foo' z This lets us take out the invariant in a later pass (CSE): foo x y z = let invariant = x * y in let foo' z = foo' (z + 1) in foo' z I don't >think< GHC is currently smart enough to detect this without SAT. A smart optimization might try and evaluate the relative cost of the SAT (heap allocation) and not doing the invariant lifting (redundant computation) and make some kind of decision about which is the lesser evil. There may even be a third option which may avoid closure creation: foo x y z = let foo' x y z invariant = (foo' x y (z + 1) invariant) in foo' x y z (x * y) (I think this is right). It corresponds operationally to computing the invariant once at the start of a loop and then carrying it on the stack rather than via the closure as with the option above, avoiding touching the heap. I'm uncertain about how different this really is though. Anyway, let's see what further investigations bring tomorrow, Cheers, Max

2008/6/23 Dan Doel
On Monday 23 June 2008, Isaac Dupree wrote:
there's no chance for the lower-level near code generation to reverse-SAT to eliminate the heap usage? (which would obviously be a different optimization that might be useful in other ways too, if it could be made to work) (did someone say that -fvia-C with gcc -O3 managed to do that sometimes?)
I said something similar, but I don't actually know the details. What I have is a SAT and non-SAT version of a benchmark:
In the SAT version, the native code generator is noticeably slower than -fvia-c -optc-O3, so the latter is doing something right. In the non-SAT version, -fvia-c is somewhat faster than the SAT version, but the NCG is just as fast. I don't know what's going on on the C end in the first case that makes it better, though (heap allocation is the same (low) in all cases; via-c doesn't fix large heap allocation when it happens in my experience).
I've tested this on my own machine (i386-apple-darwin) and I can't replicate it. For me (with 6.8.2), the SATed version is faster by almost 40%, and there is no essentially no difference between the NCG and GCC (GCC is maybe a fraction faster). In any event GHC will not automatically SAT this function because it has only one static argument. Indeed, I'm not quite sure why it should be faster with SAT at all, and my assembly-fu is still too weak to be able to work it out from the code generator output.
As an aside, I finally got around to compiling 6.9 from a day or so ago, and noticed that one of the functions I'd mentioned earlier does get SATed to the bad version (which causes lots of heap allocation in the program) because it has two static arguments.
I devoted yesterday entirely to SAT and came up with some refinements to my earlier heuristics which shave a couple of percent of allocations and runtime off of nofib compared with the existing ones, and also resolve this problem. I found that my tail call detection from the day before was a bit broken. Now that it is fixed I use it to avoid SATing functions which make tail calls, but with the important wrinkle that if the function has static function parameters we should SAT it anyway because this lets us do much more inlining and is a bigger win than avoiding allocation. This criterion avoids the heap allocation you are experiencing in your particular example. I've also added rule generation to rewrite instances of functions that are SATed purely because they have static function parameters back into the non-SATed versions if they do not experience inlinig. This prevents the worst cases that I got before when always SATing functions with static function parameters. What's more, I've found that most if not all SAT opportunities are found early in the pipeline by these new criteria, so we can get away with just one SAT pass instead of the two I reported earlier. All in all, these developments are pretty promising and I suspect I'll be able to integrate them into HEAD in time for 6.10, which will hopefully alleviate some of your worries about the efficacy of SAT. Cheers, Max

On Friday 20 June 2008 21:34:14 Dan Doel wrote:
On Friday 20 June 2008, Max Bolingbroke wrote:
Of course, if you have any suggestions for good heuristics based on your benchmarking experience then we would like to hear them! There was some discussion of this in the original ticket, http://hackage.haskell.org/trac/ghc/ticket/888, but when implementing SAT I tried out the suggestions made there without good results (though to be perfectly honest I didn't really understand the motivations behind some of the suggestions made).
However, if I had to pick something out of the air, I'd say this: always do SAT when the argument in question is a function. This is actually the reason I started doing it in the first place. I was working on sorting uvectors, and had what I thought was pretty good performance, and then I did SAT on the comparison function, and suddenly my code was running in half the time. Going back to my sorting (testing on introsort), SAT for the array doesn't seem to do much, but undoing the SAT for the comparison function causes heap allocation to balloon, and everything gets at least a factor of 2 slower.
I've been wondering if a nice option would be to be able to feed profiler information in at compile time and have it override the heuristics. That way, inlining, specialization, SAT, etc., decisions could be made based on how the code actually gets used during a typical run of the program. Cheers! -Tyson

| main = print $ foldl' (+) 0 [1..] | | with | | foldl' f y xs = foldl' y xs | where foldl' y [] = y | foldl' y (x:xs) = foldl' (f y x) xs | | runs indefinitely with very little memory consumption, while | | foldl' f y [] = y | foldl' f y (x:xs) = foldl' f (f y x) xs | | rapidly consumes all the machine's memory and dies. Others have explained this nicely. But there's a real tension here. The fast version comes from a combination of (a) the static argument transformation, so you get the first version above, and (b) bodily inlining the entire function, so that at *each call site* you get a locally-recursive function where 'f' is known. That's ok for small functions, but not so good for big ones. Furthermore, the code duplication is only worthwhile if the specialisation is truly useful. For example, would it be better to write append like this (++) xs ys = letrec app [] = ys app (x:xs) = x : app xs in app xs and inline that at every call of (++)? Probably not. So that is why GHC does not automate this transformation. If you know that's what you want, write a local recursion, and use an INLINE pragma. If someone felt like summarising this thread on the Haskell performance-advice wiki that would be great. http://haskell.org/haskellwiki/Performance Meanwhile, I'll clarify in the user manual that recursive functions are not inlined. Simon

Hello Simon, Friday, May 30, 2008, 5:30:25 PM, you wrote: may be i don't understand something. isn't it better to do automatic SAT and inline results for every recursive function marked as INLINE? it's how i want to work - just mark with INLINE speed-critical funcs. manual checking that they are recursive and doing appropriate transformation is too hard for me :) and, btw, how about adding warnings about functions marked as INLINE which was not actually inlined due to some reasons - may be very helpful for optimizing programs without going into studying Core output
Others have explained this nicely. But there's a real tension here. The fast version comes from a combination of (a) the static argument transformation, so you get the first version above, and (b) bodily inlining the entire function, so that at *each call site* you get a locally-recursive function where 'f' is known. That's ok for small functions, but not so good for big ones. Furthermore, the code duplication is only worthwhile if the specialisation is truly useful. For example, would it be better to write append like this (++) xs ys = letrec app [] = ys app (x:xs) = x : app xs in app xs and inline that at every call of (++)? Probably not.
So that is why GHC does not automate this transformation. If you know that's what you want, write a local recursion, and use an INLINE pragma.
If someone felt like summarising this thread on the Haskell performance-advice wiki that would be great. http://haskell.org/haskellwiki/Performance
Meanwhile, I'll clarify in the user manual that recursive functions are not inlined.
Simon
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (7)
-
Bulat Ziganshin
-
Dan Doel
-
Duncan Coutts
-
Isaac Dupree
-
Max Bolingbroke
-
Simon Peyton-Jones
-
Tyson Whitehead