
Yes - this is why my use of a kind of unrolling fixes concatMap for streams, because GHC is able to specialise the "unrolled" function body on a particular lambda abstraction. However, this is really a somewhat seperate issue than plain unrolling, as we just want to be able to /specialise/ recursive functions on particular arguments rather than reduce loop overhead / reassociate arithmetic over several iterations.
What is the issue with concatMap? It sounds like your specialization is based on the recursion equivalent to loop peeling (unrolling the "non-recursive" calls, the entry points into the recursion), a close variant of loop unrolling (unrolling the recursive calls, inside the loop body). If followed by the static argument transformation, that might cover the majority of hand-written worker-wrapper pairs (replacing manual by compiler optimization is always nice). So, instead of splitting recursive 'f' into 'fWorker' and 'fWrapper', with an INLINE pragma for 'fWrapper', one might in future be able just to say '{-# INLINE f PEEL 1 UNROLL 0 #-}' or, if unrolling is also desirable '{-# INLINE f PEEL 1 UNROLL 8 #-}'? And GHC would do the work, by unfolding the entry points once (the inlining of the wrapper), unfolding the recursive calls 8 times (the loop unrolling), and taking the INLINE PEEL pragma also as a hint to try the static argument transformation.
This is why the static argument transformation is such a big win (as I've mentioned before, 12% decrease in nofib runtime if you use it) - because it finds instances of recursive definitions where it's a REALLY GOOD idea to specialise on a particular argument (since that argument is actually /invariant/) and gives GHC the opportunity to specialise on it by creating a nonrecursive wrapper around the recursive worker loop.
In general, the compiler wants to be able to determine the structure of the argument of a loop body in a more fine grained way than just "invariant vs non-invariant" as SAT does. A particularly tempting example of an optimisation you could do if we dealt with recursive functions better is strength reduction. This is part of what I'm looking at implementing for GHC currently.
It seems that strength reduction could be seen as loop restructuring in the small: a multiplication, if seen as a repeated addition, can be unrolled, or the implicit adding loop can be fused with the explicit loop in which multiplication is called on (eg figure 7 in the ACM survey paper I mentioned). That way, no separate framework would be needed for strength reduction. Btw, is that also what happened in the -fvia-C path of Don's initial example in this thread (I don't know how to read those leaqs, but the imulq is gone)? But all these follow-on optimizations enabled by unfolding recursive definitions seem to require further thought and design, whereas user-controlled recursion unfolding (both peel and unroll) seems to offer immediate benefits. Is that part of your current work? Do you forsee any problems with the implementation, or with the API I suggested above (adding PEEL and UNROLL options to INLINE pragmas, to make them effective on recursive definitions as well)? Claus