
Hi Dan, I've only got time for a quick reply now, I'll see if I can take a more substantitative look at your examples next week.
Specifically, will GHC simply always perform the static argument transform, or will it have some kind of heuristic to decide when it's useful? It seems, according to some tests I've done, that it's not always a win, and is sometimes a loss (and not just with regard to code duplication). I've been hard pressed to come up with a definite rule for when it's a benefit, other than simply testing and seeing what works.
Yes. Basically the rule is that if the recursive function under consideration makes few dynamic iterations then the cost of closure allocation in the SATed version outweighs the benefits of reduced copying etc. Clearly it is quite difficult for the compiler to tell if this is likely to be the case! This has always been a well documented problem with SAT from Santos' thesis where it was introduced and is why it went unimplemented in GHC for so long.
(elided examples, no time to really get into them ATM but they seem to demonstrate this principle).
So, I suppose my question is: will GHC be better at this than I am? :) Will it know that performing the transform on shift above would cause extra heap allocation/slowdowns? Will it know that transforming reverse (and go') is slower than not doing so (I suppose it may be suspicious that this is all low-level code, but I've tried fiddling with higher-level code that I know gets compiled to this sort of code, and the above results seemed to hold)?
In short: no it will not be better at it than you! GHC performs the SAT iff the recursive call is direct (i.e. not on mutually recursive groups of functions) to reduce code expansion, and if the number of static arguments is at least 2. The reason for the last criterion is that moving a parameter to a closure implicitly adds an argument to all the functions that make reference to that variable, the implicit argument being the pointer to the closure. Eliminating an actual function argument just to add a layer of indirection via an allocated closure would be fairly pointless! There is no cleverer criterion involved, and this one was chosen just because it led to good results in the nofib benchmark suite that we consider to comprise "typical" Haskell programs for the purposes of evaluating optimisations. There will certainly be cases where it slows down the program under consideration, though the same can be said for almost any compiler optimisation ("no free lunch"). Of course, if you have any suggestions for good heuristics based on your benchmarking experience then we would like to hear them! There was some discussion of this in the original ticket, http://hackage.haskell.org/trac/ghc/ticket/888, but when implementing SAT I tried out the suggestions made there without good results (though to be perfectly honest I didn't really understand the motivations behind some of the suggestions made). Cheers, Max