
Hey guys, We have nice fusion frameworks now. E.g. stream fusion on uvector, http://hackage.haskell.org/cgi-bin/hackage-scripts/package/uvector Takes something like this: import Data.Array.Vector import Data.Bits main = print . productU . mapU (*2) . mapU (`shiftL` 2) $ replicateU (100000000 :: Int) (5::Int) and turns it into a loop like this: $wfold :: Int# -> Int# -> Int# $wfold = \ (ww_sWX :: Int#) (ww1_sX1 :: Int#) -> case ww1_sX1 of wild_B1 { __DEFAULT -> $wfold (*# ww_sWX 40) (+# wild_B1 1); 100000000 -> ww_sWX } Now, that's fine in my book. Going via -fasm, we get: Main_zdwfold_info: .LcYt: movq %rdi,%rax cmpq $100000000,%rax jne .LcYx movq %rsi,%rbx jmp *(%rbp) .LcYx: incq %rax imulq $40,%rsi movq %rax,%rdi jmp Main_zdwfold_info Ok: $ time ./product 0 ./product 0.31s user 0.00s system 96% cpu 0.316 total Going via C, however, we get: Main_zdwfold_info: cmpq $100000000, %rdi je .L6 .L2: leaq (%rsi,%rsi,4), %rax leaq 1(%rdi), %rdi leaq 0(,%rax,8), %rsi jmp Main_zdwfold_info Nice! $ time ./product 0 ./product 0.19s user 0.00s system 97% cpu 0.197 total So now, since we've gone to such effort to produce a tiny loop like, this, can't we unroll it just a little? Sadly, my attempts to get GCC to trigger its loop unroller on this guy haven't succeeded. -funroll-loops and -funroll-all-loops doesn't touch it, Anyone think of a way to apply Claus' TH unroller, or somehow convince GCC it is worth unrolling this guy, so we get the win of both aggressive high level fusion, and aggressive low level loop optimisations? -- Don

import Data.Array.Vector import Data.Bits main = print . productU . mapU (*2) . mapU (`shiftL` 2) $ replicateU (100000000 :: Int) (5::Int)
and turns it into a loop like this:
$wfold :: Int# -> Int# -> Int# $wfold = \ (ww_sWX :: Int#) (ww1_sX1 :: Int#) -> case ww1_sX1 of wild_B1 { __DEFAULT -> $wfold (*# ww_sWX 40) (+# wild_B1 1); 100000000 -> ww_sWX } .. So now, since we've gone to such effort to produce a tiny loop like, this, can't we unroll it just a little? Anyone think of a way to apply Claus' TH unroller, or somehow convince GCC it is worth unrolling this guy, so we get the win of both aggressive high level fusion, and aggressive low level loop optimisations?
I'm not sure this is what you're after (been too long since I read assembler;-), but it sounds as if you wanted to unroll the source of that fold, which seems to be a local definition in foldS? Since unrolling is not always a good idea, it would also be nice to have a way to control/initiate it from outside of the uvector package (perhaps a RULE to redirect the call from foldS to a foldSN, but foldS is hidden, and gets inlined away; but something like that). If that works, you'd then run into the issue of wanting to rearrange the *# and *# by variable and constant. Claus

2009/2/28 Don Stewart
So now, since we've gone to such effort to produce a tiny loop like, this, can't we unroll it just a little? Sadly, my attempts to get GCC to trigger its loop unroller on this guy haven't succeeded. -funroll-loops and -funroll-all-loops doesn't touch it,
Anyone think of a way to apply Claus' TH unroller, or somehow convince GCC it is worth unrolling this guy, so we get the win of both aggressive high level fusion, and aggressive low level loop optimisations?
For a couple of weeks, I have had a working solution for the concatMap problem using a sort of loop unrolling. I have tweaked the approach slightly to also unroll the worker loop to get the results you desire. You can check out the (very rough) code with: git clone http://www.cl.cam.ac.uk/~mb566/git/concatmap/.git/ $EDITOR concatmap/CallUnrollConcatMap.hs Apologies if the code is somewhat cryptic, but you should be able to get the general idea. A sneak preview is in order. The following Core: """ Rec { $wf1_s1bU [ALWAYS LoopBreaker Nothing] :: GHC.Prim.Int# -> GHC.Prim.Int# [Arity 1 Str: DmdType L] $wf1_s1bU = \ (ww_s1bO :: GHC.Prim.Int#) -> case GHC.Prim.<=# ww_s1bO 100000000 of wild_B1 [ALWAYS Dead Just A] { GHC.Bool.False -> 0; GHC.Bool.True -> let { x_XMS [ALWAYS Just L] :: GHC.Prim.Int# [Str: DmdType] x_XMS = GHC.Prim.+# ww_s1bO 1 } in case GHC.Prim.<=# x_XMS 100000000 of wild_Xx [ALWAYS Dead Just A] { GHC.Bool.False -> GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# ww_s1bO 2) 2; GHC.Bool.True -> let { x_XMX [ALWAYS Just L] :: GHC.Prim.Int# [Str: DmdType] x_XMX = GHC.Prim.+# x_XMS 1 } in case GHC.Prim.<=# x_XMX 100000000 of wild_XE [ALWAYS Dead Just A] { GHC.Bool.False -> GHC.Prim.+# (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# ww_s1bO 2) 2) (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XMS 2) 2); GHC.Bool.True -> let { x_XOf [ALWAYS Just L] :: GHC.Prim.Int# [Str: DmdType] x_XOf = GHC.Prim.+# x_XMX 1 } in case GHC.Prim.<=# x_XOf 100000000 of wild_XM [ALWAYS Dead Just A] { GHC.Bool.False -> GHC.Prim.+# (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# ww_s1bO 2) 2) (GHC.Prim.+# (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XMS 2) 2) (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XMX 2) 2)); GHC.Bool.True -> case $wf1_s1bU (GHC.Prim.+# x_XOf 1) of ww_s1bS { __DEFAULT -> GHC.Prim.+# (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# ww_s1bO 2) 2) (GHC.Prim.+# (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XMS 2) 2) (GHC.Prim.+# (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XMX 2) 2) (GHC.Prim.+# (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XOf 2) 2) ww_s1bS))) } } } } } end Rec } """ Is generated by this program: """ result = sumS . mapS (*2) . mapS (`shiftL` 2) $ enumFromToS 0 100000000 """ Of course, my approach is far from perfect: * Unrolling ALWAYS happens, and to a fixed depth * RULEs aren't very good at exploiting properties of arithmetic, as Claus has pointed out * concatMap fuses with my library but has lingering issues with allocation if join points don't get inlined and has some strictness problems too (to see this in action, try compliing the program "sumS $ mapS (+10) $ concatMapS (\x -> enumFromToS x 20) $ enumFromToS 1 10" from the same file). It also is only permitted up to a fixed depth as defined by the level of unrolling specified in the "spec" combinator. But it does get your unrolling with TODAYs GHC, transparently to the user of the uvector library. I am currently looking at other, smarter, ways that GHC can optimize loops as part of my research - so with luck this sort of manual unrolling hackery will become less relevant in the future. All the best, Max

On 01/03/2009, at 04:49, Don Stewart wrote:
So now, since we've gone to such effort to produce a tiny loop like, this, can't we unroll it just a little? Sadly, my attempts to get GCC to trigger its loop unroller on this guy haven't succeeded. -funroll-loops and -funroll-all-loops doesn't touch it,
That's because the C produced by GHC doesn't look like a loop to GCC. This can be fixed but given that we are moving away from -fvia-C anyway, it probably isn't worth doing.
Anyone think of a way to apply Claus' TH unroller, or somehow convince GCC it is worth unrolling this guy, so we get the win of both aggressive high level fusion, and aggressive low level loop optimisations?
The problem with low-level loop optimisations is that in general, they should be done at a low level. Core is much too early for this. To find out whether and how much to unroll a particular loop, you must take things like register pressure and instruction scheduling into account. IMO, the backend is the only reasonable place to do these optimisations. Using an exisiting backend like LLVM would really help here. Roman

its loop unroller on this guy haven't succeeded. -funroll-loops and -funroll-all-loops doesn't touch it,
That's because the C produced by GHC doesn't look like a loop to GCC. This can be fixed but given that we are moving away from -fvia-C anyway, it probably isn't worth doing.
That was one of my questions in the optimization and rewrite rules thread: shouldn't -fvia-C be supported (as a non-default option) for at least as long as the alternative isn't a clear win in all cases? If there are small changes that could make GHC-generated code more palatable to GCC's optimizer, wouldn't that be worth doing? Once -fvia-C is allowed to bitrot to the level of unoptimized bootstraps only, we might never get the good performance route back, so why not keep it in good shape as long as it offers real benefits?
The problem with low-level loop optimisations is that in general, they should be done at a low level. Core is much too early for this. To find out whether and how much to unroll a particular loop, you must take things like register pressure and instruction scheduling into account. IMO, the backend is the only reasonable place to do these optimisations.
[1] is one example reference for this argument (late unrolling). And since most compiler textbooks are oddly quiet about optimizations and their interactions, the survey [2] might also be helpful (and [3] has some pointers to more recent work). However, I'd like to note that Core is rather different from conventional language source-level code, so I would expect benefits from source-level "unrolling", too: Core retains much more of the high-level semantics, so both identifying loops and applying library-specific optimizations after unrolling are much easier here than at the basic block level in the backend. After all, it is just the normal unfolding/inlining that forms the starting point for so many of GHC's optimizations, which just happens to be blind to recursive definitions at the moment. Recursive definitions are quite widely used in Haskell code, so this blindspot can't be good for the overall effectiveness of GHC's optimizer. If one could mark recursive bindings with a counter, to limit unfoldings according to a compiler option, generalised loop unrolling would just be a consequence of what GHC does anyway, right? That doesn't change the point that, at the lower level, loop unrolling interacts strongly with the target architecture, and that some relevant information is not available at Core level. But it might be better to do both Core-level unfolding (to enable further Core2Core optimizations, independent of platform, that might no longer be visible at backend level) and backend-level unfolding and re-folding (to massage the low-level flow graph into a shape suited for the target architecture, about which the Core level has no information). One might also expect that Core-level transformations are affected by compiler flags which users select according to their target architecture (common practice at the moment), so Core2Core isn't entirely out of that loop, either;-) It is worth noting that there is a third level of optimizations, even after the backend, in modern hardware, as these notes [4] for an Intel compiler's unrolling option document. And since I'm collecting references, there's also Ian's TH [5] for doing the unrolling even before Core. Claus [1] An Aggressive Approach to Loop Unrolling, 1995 http://citeseer.ist.psu.edu/old/620489.html [2] Compiler Transformations for High-Performance Computing, ACM Computing Surveys, 1994 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.4885 [3] http://en.wikipedia.org/wiki/Loop_transformation [4] http://www.intel.com/software/products/compilers/flin/docs/main_for/mergedpr... [5] Unrolling and simplifying expressions with Template Haskell, Ian Lynagh, 2003 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.5.9813

Claus Reinke wrote:
its loop unroller on this guy haven't succeeded. -funroll-loops and -funroll-all-loops doesn't touch it,
That's because the C produced by GHC doesn't look like a loop to GCC. This can be fixed but given that we are moving away from -fvia-C anyway, it probably isn't worth doing.
That was one of my questions in the optimization and rewrite rules thread: shouldn't -fvia-C be supported (as a non-default option) for at least as long as the alternative isn't a clear win in all cases?
If there are small changes that could make GHC-generated code more palatable to GCC's optimizer, wouldn't that be worth doing? Once -fvia-C is allowed to bitrot to the level of unoptimized bootstraps only, we might never get the good performance route back, so why not keep it in good shape as long as it offers real benefits?
The trouble with supporting multiple backends is that the cost in terms of testing and maintenance is high. And the registerised -fvia-C backend is particularly nasty, coming as it does with thousands of lines of Perl 4 that regularly get broken by new versions of gcc. The registerised via-C backend should have been retired long ago. It's time to take it round back and shoot it. We should spend our efforts on finding a good long-term solution rather than patching this dead-end, IMHO. Cheers, Simon

That was one of my questions in the optimization and rewrite rules thread: shouldn't -fvia-C be supported (as a non-default option) for at least as long as the alternative isn't a clear win in all cases?
The trouble with supporting multiple backends is that the cost in terms of testing and maintenance is high. And the registerised -fvia-C backend is particularly nasty, coming as it does with thousands of lines of Perl 4 that regularly get broken by new versions of gcc.
Yes, I can understand that you'd like to leave that part behind sometime before yesterday:-) I assume that this very complexity means that the -fvia-C route doesn't really get all the way to its most interesting promises (easy portability, and full backend optimizations inherited from gcc). And with that in mind, I can also understand that you don't want to put in any further work into trying to improve it, if that distracts from a better long-term solution. What I don't understand yet is the routemap for replacing -fvia-C. We've seen -fvia-C being demoted from default to backup (fine by me), we've seen a feature supported only by -fvia-C removed completely, instead of seeing support for it added to the -fasm route (macro-based APIs used to work with ffi, would now require a wrapper generator, which doesn't exist yet). Indications are that -fvia-C still tends to produce better code (even though it is not the best that ghc+gcc could produce) than -fasm (is that any better for the new backend?). And last, but not least, ghc has more limited resources than gcc, so how is ghc going to beat gcc at the portability and backend optimizations game while still making progress in its core competencies (ie, higher-level improvements; there's also the interesting side-issue of how the two stages of optimizations are going to interact in ghc, if there is a barrier that can only be crossed in one direction)?
The registerised via-C backend should have been retired long ago. It's time to take it round back and shoot it. We should spend our efforts on finding a good long-term solution rather than patching this dead-end, IMHO.
No disagreement there (apart from the violent metaphor). I'm just worried about pragmatics, ie scuttling the ship before we've counted our life boats!-) And I suspect that for ghc trying to do everything itself on all platforms (rather than trying for very good -fasm on some platforms of interest, and good -fvia-C as a fallback everywhere else) is going to be anything but more work than patching that dead-end (though no doubt more interesting). In other words, what is the plan wrt to backends, especially wrt recovering the optimizations and portability issues previously left to gcc? When will the fast via-C route be retired, what quality of replacement will be in place at that time, how long to catch up to where we are now, how to keep up, etc.? Claus

Claus Reinke wrote:
That was one of my questions in the optimization and rewrite rules thread: shouldn't -fvia-C be supported (as a non-default option) for at least as long as the alternative isn't a clear win in all cases?
The trouble with supporting multiple backends is that the cost in terms of testing and maintenance is high. And the registerised -fvia-C backend is particularly nasty, coming as it does with thousands of lines of Perl 4 that regularly get broken by new versions of gcc.
Yes, I can understand that you'd like to leave that part behind sometime before yesterday:-) I assume that this very complexity means that the -fvia-C route doesn't really get all the way to its most interesting promises (easy portability, and full backend optimizations inherited from gcc). And with that in mind, I can also understand that you don't want to put in any further work into trying to improve it, if that distracts from a better long-term solution. What I don't understand yet is the routemap for replacing -fvia-C. We've seen -fvia-C being demoted from default to backup (fine by me), we've seen a feature supported only by -fvia-C removed completely, instead of seeing support for it added to the -fasm route (macro-based APIs used to work with ffi, would now require a wrapper generator, which doesn't exist yet). Indications are that -fvia-C still tends to produce better code (even though it is not the best that ghc+gcc could produce) than -fasm (is that any better for the new backend?). And last, but not least, ghc has more limited resources than gcc, so how is ghc going to beat gcc at the portability and backend optimizations game while still making progress in its core competencies (ie, higher-level improvements; there's also the interesting side-issue of how the two stages of optimizations are going to interact in ghc, if there is a barrier that can only be crossed in one direction)?
Ok, thanks for bringing these points up. Hopefully I'll be able to lay your fears to rest: 1. Performance. -fvia-c currently produces code that is on average about 1% faster than -fasm: http://www.cse.unsw.edu.au/~dons/nobench/x86_64/results.html There's one notable exception: floating-point code on x86 (not x86_64) is terrible with -fasm, because our native code generator has a particularly simple/stupid implementation of the x87 instruction set. So we need to make the SSE2 code generator in the x86_64 backend work for x86, too. Having said that, the native backend has much more potential for generating faster code than we can with gcc. Firstly, it can re-use fixed registers (e.g. argument registers) within a basic block, whereas gcc can't. We don't do this currently because the C-- lacks the liveness information on jumps, but the new backend will be able to do it. I bet this alone will be worth more than that 1%. Secondly we have a much better handle on aliasing inside GHC than gcc does, and there's no good way to tell gcc what we know about aliasing. On x86, gcc has a grand total of 2 spare registers, which means it has virtually no scope for generating good code. There's also not much room for generating C that is more amenable to gcc's optimisations. The obvious thing to do is to make recursive functions look like loops. We've tried it (there's some experimental code in GHC to do it), IIRC it didn't buy very much. The lack of registers, and the lack of knowledge about aliasing (heap doesn't alias with stack) meant that gcc didn't do some obvious-looking optimisations. Trying to do better here is a dead end. 2. Portability. We haven't had a single new registerised port of GHC in many years now. While the via-C backend seems at first glance to offer some portability benefits, in practice porting the mangler is still a pain unless your platform is very similar to an existing one (e.g. vanilla ELF). The only C-only registerised port we had was Sparc, and thanks to Ben Lippmeier we now have a native backend for that too. Dropping the C backend won't harm any of our existing ports, and it doesn't seem like people are making new ports of GHC this way either. We'll still have the unregisterised porting route, whose only drawback is performance. Still, lots of platforms are successfully using unregisterised GHC ports (via Debian). One day maybe we'll have an LLVM backend, or similar. My impression is that right now we can't make an LLVM backend with as good performance as our native backend, without changes in LLVM. Maybe that will change. Nothing that we're doing now precludes adding an LLVM backend later, I believe. 3. Features. This is a non-issue: -fvia-C vs. -fasm should not affect what programs work. Up until 6.10.1 we had a bug whereby you could use -fvia-C to bind to CPP-based C APIs, but that bug was removed in 6.10.1. Ok, I realise that some people considered this to be a feature and its removal to be a regression. However, I believe it's more important that we conform to the FFI spec, and for -fasm to be consistent with -fvia-C. It's a slight inconvenience to have to write the wrappers, but in return you get more robust code. Of course we should have tool support to make generating the wrappers easier. Cheers, Simon

Dear Simon*, thanks for answering my concerns about -fvia-C replacement. Are these answers somewhere in the ghc wiki, or perhaps they'd make a good basis for a useful ghc blog post? So, -fasm will soon be up to speed with -fvia-C in all cases, new native backends are not more difficult than more mangler branches, and tools for generating wrappers for CPP-based APIs should be provided. Just one additional point re:
We haven't had a single new registerised port of GHC in many years now.
Interest in new platforms is increasing again, though. People have been talking about PS3, internet tablets, multicore machines, .. Personally, I'd like to be able to use GHC on, or at least for, coming smartphone generations, etc. (I don't see myself looking at native backends there, but probably I wouldn't have braved the mangler, either; still, someone else might prefer one over the other). And I don't understand how people can be happy with unregisterised GHC ports for long, given how many optimizations GHC is not doing even in best form!-) Thanks again, Claus

| What I don't understand yet is the routemap for replacing -fvia-C Good points, Claus. I think the story is as follows: * -fvia-C does not produce much better code, except in exceptionally tight loops, because GHC gives gcc very little scope for optimisation. Simon mentioned something like 1% improvement. * -fvia-C does not give substantially improved portability, because the Evil Mangler must have lots of new (Perl) code for each new platform. (And each new version of gcc changes the details.) * -fvia-C does impose maintenance costs, as this thread has rehearsed. * -fasm has the potential for producing *better* code than gcc, because we can temporarily re-use registers that we must nail down as far as gcc is concerned. | In other words, what is the plan wrt to backends, especially wrt | recovering the optimizations and portability issues previously left to | gcc? I think you may be over-optimistic about the portability and optimisation benefits. As to other back end plans, it's a fairly active place. Ben L is doing great stuff on refactoring the native code back end as part of his Sparc NCG. And John and Norman and I are actively (albeit diverted recently by ICFP submissions) working on getting the refactored STG...flat C-- story into mainstream. Simon

So now, since we've gone to such effort to produce a tiny loop like, this, can't we unroll it just a little? it is worth unrolling this guy, so we get the win of both aggressive high level fusion, and aggressive low level loop optimisations?
It might be useful to point out that the interaction goes both ways. Not only are fused loops candidates for unrolling, but unrolling can also enable fusion, giving one example of why Core-level unrolling (in addition to backend-level loop restructuring) would be useful. Consider this silly example (with Apply as before, in the rewrite rules thread, just syntactically unrolling the loop, and loop as before, but generalised to arbitrary accumulators, see below): -------------------------------------------------------- {-# LANGUAGE MagicHash #-} {-# LANGUAGE CPP #-} {-# LANGUAGE TemplateHaskell #-} {-# LANGUAGE BangPatterns #-} import Data.Array.Vector import Data.Bits import Apply import GHC.Prim import GHC.Base main = print $ loop 1 10000000 body (toU [1,2,3,4,5::Int]) body i arr = mapU (42+) arr -------------------------------------------------------- Here, the refusal to partially unfold recursive definitions means there are no opportunities for fusion, whereas unrolling enables fusion (which wouldn't work if unrolling was done only in the backend, after fusion). -------------------------------------------------------- {-# INLINE loop #-} loop :: Int -> Int -> (Int -> acc -> acc) -> acc -> acc loop i max body acc = loopW i acc where #ifdef N loopW !i !acc | i+N<=max = loopW (i+N) ($(apply (0::Int) N) (\j acc->body (i+j) acc) acc) #endif loopW !i !acc | i<=max = loopW (i+1) (body i acc) | otherwise = acc -------------------------------------------------------- Compare the versions without and with unrolling, not just for time, but for allocation (+RTS -s). As usual, we'd like to reassociate the sums to enable constant folding, but this rule {-# RULES -- "reassoc" forall a# b# c. ((I# a#) +# ((I# b#) +# c)) = ((I# a#) +# (I# b#)) +# c #-} is rejected. Claus

2009/3/1 Claus Reinke
It might be useful to point out that the interaction goes both ways. Not only are fused loops candidates for unrolling, but unrolling can also enable fusion, giving one example of why Core-level unrolling (in addition to backend-level loop restructuring) would be useful.
Yes - this is why my use of a kind of unrolling fixes concatMap for streams, because GHC is able to specialise the "unrolled" function body on a particular lambda abstraction. However, this is really a somewhat seperate issue than plain unrolling, as we just want to be able to /specialise/ recursive functions on particular arguments rather than reduce loop overhead / reassociate arithmetic over several iterations. This is why the static argument transformation is such a big win (as I've mentioned before, 12% decrease in nofib runtime if you use it) - because it finds instances of recursive definitions where it's a REALLY GOOD idea to specialise on a particular argument (since that argument is actually /invariant/) and gives GHC the opportunity to specialise on it by creating a nonrecursive wrapper around the recursive worker loop. In general, the compiler wants to be able to determine the structure of the argument of a loop body in a more fine grained way than just "invariant vs non-invariant" as SAT does. A particularly tempting example of an optimisation you could do if we dealt with recursive functions better is strength reduction. This is part of what I'm looking at implementing for GHC currently. Cheers, Max

Yes - this is why my use of a kind of unrolling fixes concatMap for streams, because GHC is able to specialise the "unrolled" function body on a particular lambda abstraction. However, this is really a somewhat seperate issue than plain unrolling, as we just want to be able to /specialise/ recursive functions on particular arguments rather than reduce loop overhead / reassociate arithmetic over several iterations.
What is the issue with concatMap? It sounds like your specialization is based on the recursion equivalent to loop peeling (unrolling the "non-recursive" calls, the entry points into the recursion), a close variant of loop unrolling (unrolling the recursive calls, inside the loop body). If followed by the static argument transformation, that might cover the majority of hand-written worker-wrapper pairs (replacing manual by compiler optimization is always nice). So, instead of splitting recursive 'f' into 'fWorker' and 'fWrapper', with an INLINE pragma for 'fWrapper', one might in future be able just to say '{-# INLINE f PEEL 1 UNROLL 0 #-}' or, if unrolling is also desirable '{-# INLINE f PEEL 1 UNROLL 8 #-}'? And GHC would do the work, by unfolding the entry points once (the inlining of the wrapper), unfolding the recursive calls 8 times (the loop unrolling), and taking the INLINE PEEL pragma also as a hint to try the static argument transformation.
This is why the static argument transformation is such a big win (as I've mentioned before, 12% decrease in nofib runtime if you use it) - because it finds instances of recursive definitions where it's a REALLY GOOD idea to specialise on a particular argument (since that argument is actually /invariant/) and gives GHC the opportunity to specialise on it by creating a nonrecursive wrapper around the recursive worker loop.
In general, the compiler wants to be able to determine the structure of the argument of a loop body in a more fine grained way than just "invariant vs non-invariant" as SAT does. A particularly tempting example of an optimisation you could do if we dealt with recursive functions better is strength reduction. This is part of what I'm looking at implementing for GHC currently.
It seems that strength reduction could be seen as loop restructuring in the small: a multiplication, if seen as a repeated addition, can be unrolled, or the implicit adding loop can be fused with the explicit loop in which multiplication is called on (eg figure 7 in the ACM survey paper I mentioned). That way, no separate framework would be needed for strength reduction. Btw, is that also what happened in the -fvia-C path of Don's initial example in this thread (I don't know how to read those leaqs, but the imulq is gone)? But all these follow-on optimizations enabled by unfolding recursive definitions seem to require further thought and design, whereas user-controlled recursion unfolding (both peel and unroll) seems to offer immediate benefits. Is that part of your current work? Do you forsee any problems with the implementation, or with the API I suggested above (adding PEEL and UNROLL options to INLINE pragmas, to make them effective on recursive definitions as well)? Claus

2009/3/1 Claus Reinke
What is the issue with concatMap?
ConcatMap doesn't usually fuse under stream fusion - see http://www.cse.unsw.edu.au/~dons/papers/stream-fusion.pdf for the gory details.
It sounds like your specialization is based on the recursion equivalent to loop peeling (unrolling the "non-recursive" calls, the entry points into the recursion), a close variant of loop unrolling (unrolling the recursive calls, inside the loop body).
This sounds right - to get concatMap specialised I "unpeel" the unstream loop (which has been slightly modified) 4 iterations, after which unstream recurses back into itself in a tight loop. This lets GHC specialise the first 3 iterations however it likes. This is achieved by the spec4 combinator in the code I posted.
If followed by the static argument transformation, that might cover the majority of hand-written worker-wrapper pairs (replacing manual by compiler optimization is always nice).
Right. Since GHC is so blind to recursion at the moment this could be a substantial win (though my gut tells me that SAT alone is a large part of the win here).
So, instead of splitting recursive 'f' into 'fWorker' and 'fWrapper', with an INLINE pragma for 'fWrapper', one might in future be able just to say '{-# INLINE f PEEL 1 UNROLL 0 #-}' or, if unrolling is also desirable '{-# INLINE f PEEL 1 UNROLL 8 #-}'? And GHC would do the work, by unfolding the entry points once (the inlining of the wrapper), unfolding the recursive calls 8 times (the loop unrolling), and taking the INLINE PEEL pragma also as a hint to try the static argument transformation.
Right, and INLINE PEEL might be a nice interface for the user. Of course, we'd probably want an automated system for working out when this is a good idea as well - in the same way that we have INLINE pragmas and a load of inlining heuristics.
It seems that strength reduction could be seen as loop restructuring in the small: a multiplication, if seen as a repeated addition, can be unrolled, or the implicit adding loop can be fused with the explicit loop in which multiplication is called on (eg figure 7 in the ACM survey paper I mentioned).
I hadn't thought about it in quite those terms before - cute :-)
That way, no separate framework would be needed for strength reduction. Btw, is that also what happened in the -fvia-C path of Don's initial example in this thread (I don't know how to read those leaqs, but the imulq is gone)?
I am no assembly guru and haven't seen that last form of leaq either, but I'm going to guess that: leaq (%rsi,%rsi,4), %rax Says that rax is rsi * ((1 + 1) * 2 ^ 4) = rsi * 32 leaq 0(,%rax,8), %rsi And that this finishes it off by adding the final 8* to the mix. So it makes the multiplication easier by breaking it into two multiplications by powers of two. Smart, but you don't need any loop unrolling tech to do it.
But all these follow-on optimizations enabled by unfolding recursive definitions seem to require further thought and design, whereas user-controlled recursion unfolding (both peel and unroll) seems to offer immediate benefits. Is that part of your current work?
I hadn't actually considered a mechanism user-controlled peel/unroll at all! I was totally focused on automatic transformations :-)
Do you forsee any problems with the implementation, or with the API I suggested above (adding PEEL and UNROLL options to INLINE pragmas, to make them effective on recursive definitions as well)?
The implementation I'm thinking of is basically trivial. You just add the information gathered from the pragmas onto the Ids, then have a dedicated core pass that looks at the pragmas and does it's worker/wrapper thing. The technology to do peeling/unrolling is trivial and there already examples in the codebase (in case liberation and SAT). If someone can spec out what they actually want and GHC HQ give it the thumbs up I would be happy to do the grunt work on implementing this feature. I'm not so sure about the user interface - for the purposes of compatibility with other compiler's notion of INLINE perhaps a dedicated PEEL / UNROLL pragma is a good idea. This would be less painful if we had a positional notation for pragmas - which has been mooted in the past wrt. the annotation system for compiler plugins (which IS in HEAD). AFAIK the only reason we don't have this is that we haven't had a discussion about how it should look. See "Future work" on the page http://hackage.haskell.org/trac/ghc/wiki/Annotations Incidentally, Simon PJ has just made GHC warn about INLINE pragmas on recursive things (not something I totally sure is a good idea, since the compiler can make things non-recursive behind your back) but which you can justify by saying that /normally/ GHC won't INLINE recursive things, so it's misleading to have INLINE pragmas on them accepted. This can be taken as an argument against adding PEEL / UNROLL to INLINE. Cheers, Max

On March 1, 2009 17:31:13 Max Bolingbroke wrote:
I am no assembly guru and haven't seen that last form of leaq either, but I'm going to guess that:
leaq (%rsi,%rsi,4), %rax
Says that rax is rsi * ((1 + 1) * 2 ^ 4) = rsi * 32
leaq 0(,%rax,8), %rsi
If I recall correctly, leaq is load effective address (i.e., write the address into the destination register instead of the data at the address). The address form is i(b,o,s) = i+b+o*s. You have (%rsi,%rsi,4) = %rsi+%rsi*4 into %rax followed by 0(,%rax,8) = rax*8 into %rsi, ultimately giving %rsi*40 into %rsi (which is the multiplication you have in the ghc generated loop). (the restrictions on the address form is that s must be one of 1, 2, 4, or 8) Interesting discussion by the way. : ) Cheers! -Tyson

The implementation I'm thinking of is basically trivial. You just add the information gathered from the pragmas onto the Ids, then have a dedicated core pass that looks at the pragmas and does it's worker/wrapper thing. The technology to do peeling/unrolling is trivial and there already examples in the codebase (in case liberation and SAT). If someone can spec out what they actually want and GHC HQ give it the thumbs up I would be happy to do the grunt work on implementing this feature.
Yes, please!-) My preferred spec would be roughly {-# NOINLINE f #-} as now {-# INLINE f #-} works as now, which is for non-recursive f only (might in future be taken as go-ahead for analysis-based recursion unfolding) {-# INLINE f PEEL n #-} inline calls *into* recursive f (called loop peeling for loops) {-# INLINE f UNROLL m #-} inline recursive calls to f *inside* f (called loop unrolling for loops) {-# INLINE f PEEL n UNROLL m #-} combine the previous two The numeric parameters are to be interpreted as if each call to f was annotated with both PEEL and UNROLL limits, to be decreased as appropriate for every PEEL or UNROLL action. Peeling and unrolling stop when the respective count annotation has reached 0. Note that mutual recursion is the domain of PEEL, while UNROLL only applies to direct recursion. {-# INLINE f PEEL n #-}, for n>0, corresponds to worker/ wrapper transforms (previously done manually) + inline wrapper, and should therefore also be taken as a hint for the compiler to try the static argument transformation for f (the "worker"). Non-supporting implementations should treat these as INLINE pragmas (same warning/ignore or automatic unfold behaviour). About the pragma name: as far as I can tell, Hugs simply ignores INLINE pragmas, no matter what they say, other implementations could just ignore the PEEL/UNROLL part (possibly with a warning) - do any of them support INLINE on recursive definitions? The only problem is that GHC itself fails with a parse error, which would lead to version issues (perhaps GHC should have allowed for additional information to otherwise syntactically complete pragmas, or warnings instead of errors, but that hitch is out in the wild now). Having separate PEEL/UNROLL pragmas would make ignoring the default action, but would clutter the pragma name space as well as the source code; it also wouldn't make explicit that we are indeed refining the INLINE pragma for the case of recursive functions (which GHC currently ignores or complains about), by detailing how we want the recursive definition to be inlined. Since we are talking about a refinement of the INLINE pragma, we also need to look at that pragma's existing subtleties:-( - no functions inlined into f: should be subject to override by INLINE pragmas (even for the non-recursive case?) - no float-in/float-out/cse: ?? - no worker/wrapper transform in strictness analyser: we do get the same effect from INLINE PEEL, so this should be okay, right? - loop breakers: PEEL/UNROLL have their own limits, creating an intrinsic loop breaker when the counters run out Is that sufficient? Claus

My preferred spec would be roughly
{-# NOINLINE f #-} as now
{-# INLINE f #-} works as now, which is for non-recursive f only (might in future be taken as go-ahead for analysis-based recursion unfolding)
{-# INLINE f PEEL n #-} inline calls *into* recursive f (called loop peeling for loops)
{-# INLINE f UNROLL m #-} inline recursive calls to f *inside* f (called loop unrolling for loops)
{-# INLINE f PEEL n UNROLL m #-} combine the previous two
The numeric parameters are to be interpreted as if each call to f was annotated with both PEEL and UNROLL limits, to be decreased as appropriate for every PEEL or UNROLL action.
hmm, "appropriate" is one of those words that shouldn't occur in specs, not even rough ones, so let's flesh this out a bit, by abstract example. let f = ..f.. in f{n,m} -PEEL-> let f = ..f.. in ..f{n-1,m}.. let f = ..f{n,m}.. in .. -UNROLL-> let f = ..|..f{n,m-1}..|.. in .. In words: the call being peeled/unrolled disappears, being replaced by a copy of the definition, in which the decremented counts are applied to the calls of the same function created by unfolding. Is that specific enough?
Peeling and unrolling stop when the respective count annotation has reached 0. Note that mutual recursion is the domain of PEEL, while UNROLL only applies to direct recursion.
{-# INLINE f PEEL n #-}, for n>0, corresponds to worker/ wrapper transforms (previously done manually) + inline wrapper, and should therefore also be taken as a hint for the compiler to try the static argument transformation for f (the "worker").
Non-supporting implementations should treat these as INLINE pragmas (same warning/ignore or automatic unfold behaviour).
Since we are talking about a refinement of the INLINE pragma, we also need to look at that pragma's existing subtleties:-(
- no functions inlined into f: should be subject to override by INLINE pragmas (even for the non-recursive case?) - no float-in/float-out/cse: ?? - no worker/wrapper transform in strictness analyser: we do get the same effect from INLINE PEEL, so this should be okay, right? - loop breakers: PEEL/UNROLL have their own limits, creating an intrinsic loop breaker when the counters run out
Loop breakers are still needed, in spite of the explicit limits. Consider let {odd x = ..even{1,0}..; even x = ..odd{1,0}..} in odd{1,0} n Peeling odd gives a call to even, peeling of which gives a fresh, not decremented, call to odd! Unless one makes a copy of the whole mutual recursion, with the odd calls adjusted. This might be easier to handle in your "unfolding as a separate core2core pass" scenario, where the pass might keep track of unfoldings already done (instead of trying to encode that information locally, in annotations). Other issues? Claus

On 2009 Mar 6, at 19:07, Claus Reinke wrote:
Loop breakers are still needed, in spite of the explicit limits. Consider
let {odd x = ..even{1,0}..; even x = ..odd{1,0}..} in odd{1,0} n
{-# INLINE odd even PEEL n #-} ? -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

2009/3/7 Claus Reinke
hmm, "appropriate" is one of those words that shouldn't occur in specs, not even rough ones, so let's flesh this out a bit, by abstract example.
let f = ..f.. in f{n,m} -PEEL-> let f = ..f.. in ..f{n-1,m}..
Probably what you intend here is that you create one copy of the definition every round rather than one per call site, is that right? In the case of mutual recursion, I suppose something like this should happen: f = ... g ... g = ... f ... ==> f = ... g ... g = ... f ... f1 = ... g ... g1 = ... f ... i.e. after peeling f1 and g1 are free to be inlined into the use site if GHC decides that is a good idea. Similarly for two rounds of peeling you would get: f = ... g ... g = ... f ... f1 = ... g ... g1 = ... f ... f2 = ... g1 ... g2 = ... f1 ...
let f = ..f{n,m}.. in .. -UNROLL-> let f = ..|..f{n,m-1}..|.. in ..
Similarly I suppose you intended that you get one copy of the body per UNROLL, rather than per call-site? i.e: f = ... f ... ==> f = ... f1 ... f1 = ... f2 ... f2 = ... f ... I'm not completely convinced that this doesn't make sense for mutual recursion: f = ... g ... g = ... f ... ==> f = ... g ... g = ... f1 ... f1 = ... g1 ... g1 = ... f ... I'm not quite sure how to generalize that though :-)
Non-supporting implementations should treat these as INLINE pragmas (same warning/ignore or automatic unfold behaviour).
Maybe they SHOULD do, but there are a lot of compilers out there in the real world that won't :-). Making these entirely new pragmas feels better to me. I spoke to Simon PJ about these pragmas and he didn't sound terribly enthusiatic - but he suggested they would be a nice use case for compiler plugins :-). Plugins would only be capable of dealing with UNROLL / PEEL as new pragmas. Of course, this kind of relies on us getting plugins into the HEAD sometime...
- no functions inlined into f: should be subject to override by INLINE pragmas (even for the non-recursive case?)
If UNROLL / PEEL are seperate annotations we won't prevent inlining into the UNROLLed/PEELed thing. But that might be bad! What if we have: x = BIG {-# UNROLL f 3 #-} f = ... x ... f ... Now if we unconditionally inline x into f as the only use site we will end up bloating up the code if we later run the unroller. However, if we unroll first then the simplifier won't inline x and things will be good. So perhaps you are right to say that this should be an extension of INLINE. As for the more general question about whether you should inline stuff inside INLINEs at all - well, AFAIK the latest work on this by Simon means that stuff /will/ be inlined inside them, but if that body is subsequently inlined what gets inlined is the /original/ body as the user wrote it in his source code. This improves performance when for some reason a value doesn't get inlined.
- no float-in/float-out/cse: ??
The restriction on CSE is principally for NOINLINE things, to prevent messing with RULEs by changing identifiers around. I'm not sure if that is relevant here. No float-in makes sure we don't increase the size of things we are going to INLINE. This is important with UNROLL / PEEL for the same reason as above - another argument for this being an extension of INLINE so we inherit its semantics. I don't actually know why the no-float-out restriction exists (after all, it only makes the body smaller!) so I'm not sure what the right thing to do would be there.
- no worker/wrapper transform in strictness analyser: we do get the same effect from INLINE PEEL, so this should be okay, right?
Maybe I don't understand what you mean, but I don't think this is true. For example, w/w can unpack a strict argument of product type, but I dont' think PEEL will let you achieve that. This restriction exists to prevent losing INLINE pragmas: """ Note [Don't w/w inline things] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It's very important to refrain from w/w-ing an INLINE function If we do so by mistake we transform f = __inline (\x -> E) into f = __inline (\x -> case x of (a,b) -> fw E) fw = \ab -> (__inline (\x -> E)) (a,b) and the original __inline now vanishes, so E is no longer inside its __inline wrapper. Death! Disaster! """ So we might want to prevent w/wing UNROLL/PEEL for the same reasons.. but if we do the UNROLL/PEEL "pass" early enough (i.e. before strictness - which is quite late in the pipeline) then this issue will go away.
- loop breakers: PEEL/UNROLL have their own limits, creating an intrinsic loop breaker when the counters run out
This might be easier to handle in your "unfolding as a separate core2core pass" scenario, where the pass might keep track of unfoldings already done (instead of trying to encode that information locally, in annotations).
I think that makes most sense. If we run it early enough we would be reasonably sure our program was close to what the user intended and hence could sidestep some of the restrictions on INLINE pragmas that are discussed above by simply not applying them to UNROLL/PEEL, though this doesn't feel like a terribly satisfactory solution. Cheers, Max

let f = ..f.. in f{n,m} -PEEL-> let f = ..f.. in ..f{n-1,m}..
Probably what you intend here is that you create one copy of the definition every round rather than one per call site, is that right?
I don't think so - ultimately, the point of both peeling and unrolling is to unfold a definition into a use site, to enable further optimizations, not just to move from a recursive to a non-recursive definition. We could try to do it in two steps, as you suggest, but that would expose us to the heuristics of GHC inlining again (will or won't it inline the new shared definition?), in the middle of a user-annotation-based unfolding. As for the remainder of your useful reply, I'll have to think more about how to make a local-rule-based approach work properly (without the hickups of my first sketch) before I can think about the interactions. I still think it would be useful to have such a rule-based description, even if a monolithic core2core pass may be easier to implement: having two independent specs makes it easier to spot inconsistencies, and if the rule-based form doesn't get too complicated, it should be more suited for documentation. Claus

2009/3/9 Claus Reinke
let f = ..f.. in f{n,m} -PEEL-> let f = ..f.. in ..f{n-1,m}..
Probably what you intend here is that you create one copy of the definition every round rather than one per call site, is that right?
I don't think so - ultimately, the point of both peeling and unrolling is to unfold a definition into a use site, to enable further optimizations, not just to move from a recursive to a non-recursive definition. We could try to do it in two steps, as you suggest, but that would expose us to the heuristics of GHC inlining again (will or won't it inline the new shared definition?), in the middle of a user-annotation-based unfolding.
Ah - I was thinking of something a bit different, where: * PEEL / UNROLL pragmas duplicate the method body once per level of peeling / unrolling and fix up the recursive calls as appropriate * The user optionally adds an INLINE pragma to the function if he additionally wants to be SURE that those duplicates get inlined at the use sites This means that PEEL / UNROLL represent nice logically-orthogonal bits of functionality to INLINE-ing. Furthermore, I'm not too keen on duplicating method bodies at call sites willy-nilly because it may lead to increased allocations (of the function closures) in inner loops. At least if you bind the duplicated methods at the same level as the thing you are duplicating you only increase the dynamic number of closures created by a constant factor! I've actually been thinking about using a different strategy for case liberation (which duplicates method bodies at call sites) to make it more constructor-specialisation like (which duplicates method bodies at the definition site) partly for this reason. Cheers, Max

let f = ..f.. in f{n,m} -PEEL-> let f = ..f.. in ..f{n-1,m}..
Probably what you intend here is that you create one copy of the definition every round rather than one per call site, is that right?
I don't think so - ultimately, the point of both peeling and unrolling is to unfold a definition into a use site, to enable further optimizations, not just to move from a recursive to a non-recursive definition. We could try to do it in two steps, as you suggest, but that would expose us to the heuristics of GHC inlining again (will or won't it inline the new shared definition?), in the middle of a user-annotation-based unfolding.
Ah - I was thinking of something a bit different, where:
* PEEL / UNROLL pragmas duplicate the method body once per level of peeling / unrolling and fix up the recursive calls as appropriate * The user optionally adds an INLINE pragma to the function if he additionally wants to be SURE that those duplicates get inlined at the use sites
Ok, I suspected as much. You'd need to make the 'INLINE f' apply to the generated 'fN', of course.
This means that PEEL / UNROLL represent nice logically-orthogonal bits of functionality to INLINE-ing.
Usually, I'm all for orthogonality, and for more knobs to allow hand-tuning of things that have no automatically reachable optimal solutions. In this case, however, I'm not sure anything would be gained. I recall that your hand- unrolled code was written in a similar style, and assumed that it was a question of style, which GHC would inline into the same code. But if you annotate all your unrolled and peeled new definitions as NOINLINE, do you still get the optimizations you want? There are probably a few GHC optimizations that can "look through" non-recursive lets, but RULES are not among those. For loop-style recursion, there'd be only one use per definition, so inlining would be the default and there'd be no difference, but for non-loop-style recursion, inlining might not happen, and so no further optimizations would be enabled. Off the top of my head, I can't think of a case where that would lead to improved code, but as I'm discovering, I'm not very familiar with the details of what optimizations GHC is actually doing (though this is quite helpful: http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/HscMain ) so I might be missing something?
Furthermore, I'm not too keen on duplicating method bodies at call sites willy-nilly because it may lead to increased allocations (of the function closures) in inner loops. At least if you bind the duplicated methods at the same level as the thing you are duplicating you only increase the dynamic number of closures created by a constant factor!
Yes, every form of INLINE has its limits. But if users say they want inlining (or peeling or unrolling or any other form of unfolding), that's what they should get, including those worrysome duplications. The idea is to create lots of added code (in order to remove abstractions that might hide optimization opportunities), which will then be simplified to something smaller (or at least better performing) than what we started out with. Providing the means to fine tune the amount of duplications might be useful, but preventing them entirely is not an option. Claus

2009/3/9 Claus Reinke
NOINLINE, do you still get the optimizations you want? There are probably a few GHC optimizations that can "look through" non-recursive lets, but RULES are not among those.
The benefit that comes immediately to mind is extra freedom for the code generator. If we have several copies of the body of e.g. a loop it may be able to schedule instructions much better. This is why GCC unrolls loops, of course. Of course, Core may not be the best place to do this sort of unrolling as Roman pointed out earlier in the thread. But yeah, beyond this I don't /think/ that non-inlined duplications would help GHC at all (it might be a different story if we did partial inlining). All the best, Max

Claus, Max | > My preferred spec would be roughly | > | > {-# NOINLINE f #-} | > as now | > | > {-# INLINE f #-} | > works as now, which is for non-recursive f only (might in future | > be taken as go-ahead for analysis-based recursion unfolding) | > | > {-# INLINE f PEEL n #-} | > inline calls *into* recursive f (called loop peeling for loops) | > | > {-# INLINE f UNROLL m #-} | > inline recursive calls to f *inside* f (called loop unrolling for loops) | > | > {-# INLINE f PEEL n UNROLL m #-} | > combine the previous two Sounds as if you two are evolving a good design, thank you. I am not following the details closely, but I have the advantage of being able to chat to Max directly. Suggestion: if after discussion you think this is a valuable thing to do, write a GHC-Trac-Wiki page describing the design as precisely as possible (eg with examples; I find the above one-liners hard to grok). Along with any major design alternatives. Ideally with a few indicative measurements gotten by by-hand transformations, that show there are real benefits to be had. For implementation, there are two routes. Either totally built-in, or using a Core-to-Core plug-in. The thing I like about the latter is that it can be done without having GHC HQ in the critical path, because we (I) tend to slow things down, being a uniprocesor. We don't have the plug-in capability yet, but I'm encouraging Max to polish it up so that we do. I think it'd be a very valuable facility. Simon

Hello Simon, Thursday, March 12, 2009, 1:29:56 AM, you wrote:
For implementation, there are two routes. Either totally built-in, or using a Core-to-Core plug-in. The thing I like about the latter is that it can be done without having GHC HQ in the critical path, because we (I) tend to slow things down, being a uniprocesor. We don't have the plug-in capability yet, but I'm encouraging Max to polish it up so that we do. I think it'd be a very valuable facility.
as GHC popularity grows, it may become more profitable to spend resources making it more open, pluggable "compiler factory" rather than doing everything yourself -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 07/03/2009, at 09:26, Claus Reinke wrote:
My preferred spec would be roughly
{-# NOINLINE f #-} as now
{-# INLINE f #-} works as now, which is for non-recursive f only (might in future be taken as go-ahead for analysis-based recursion unfolding)
{-# INLINE f PEEL n #-} inline calls *into* recursive f (called loop peeling for loops) {-# INLINE f UNROLL m #-} inline recursive calls to f *inside* f (called loop unrolling for loops)
{-# INLINE f PEEL n UNROLL m #-} combine the previous two
The problem here is that this only works for directly recursive functions which I, for instance, don't normally use in high- performance code. Most of my loops are pipelines of collective combinators like map, filter, fold etc. because these are the ones that can be fused automatically. Unless I'm misunderstanding something, this approach doesn't handle such cases. Roman

2009/3/9 Roman Leshchinskiy
The problem here is that this only works for directly recursive functions which I, for instance, don't normally use in high-performance code. Most of my loops are pipelines of collective combinators like map, filter, fold etc. because these are the ones that can be fused automatically. Unless I'm misunderstanding something, this approach doesn't handle such cases.
Yep, I think this is an orthogonal piece of functionality. I believe Claus is concerned with getting the compiler to perform some of the transformations people currently might want to do for their directly recursive functions. Of course, you could still UNROLL your unstream definition, but that doesn't give the user any control over the amount of unrolling that takes place, which as you have pointed out earlier may not be a great idea! Cheers, Max

{-# INLINE f PEEL n #-} inline calls *into* recursive f (called loop peeling for loops) {-# INLINE f UNROLL m #-} inline recursive calls to f *inside* f (called loop unrolling for loops)
{-# INLINE f PEEL n UNROLL m #-} combine the previous two
The problem here is that this only works for directly recursive functions which I, for instance, don't normally use in high- performance code. Most of my loops are pipelines of collective combinators like map, filter, fold etc. because these are the ones that can be fused automatically. Unless I'm misunderstanding something, this approach doesn't handle such cases.
Actually, my first sketch had a problem in that it would work only too well for mutually recursive functions, making it necessary to use loop breakers in spite of the explicit limits (even if we limit unroll to direct recursion, as I intended originally, peeling would then apply to the calls into other functions in the recursion). One way out would be to treat the whole mutual recursion as a single entity, either implicitly, as I indicated, or explicitly, as I interpret Brandon's somewhat ambiguous comment. In other words, the peel/unroll limits would apply to a whole group of mutually recursive definitions, ensuring termination of the inline process without additional loop breakers. If we do that, then it might make sense to talk about peeling/unrolling wrt the whole recursion group. In any case, I need to refine my spec!-) But this discussion is very helpful in finding the issues that need to be addressed and clarified. Another issue that I ran into in manual unrolling is that I sometimes want to unroll wrt a specific parameter of a multi- parameter function, usually because that parameter can only have a very small numer of possible values, or just because the original function encodes multiple loops that I want to disentangle. Claus

On 2009 Mar 9, at 9:32, Claus Reinke wrote:
One way out would be to treat the whole mutual recursion as a single entity, either implicitly, as I indicated, or explicitly, as I interpret Brandon's somewhat ambiguous comment. In other words, the peel/unroll limits would apply to a whole group of mutually
Sorry, yes, I intended that the unrolling applied explicitly to a group of mutually recursive functions. I'm not sure if the unroll/ peel counts should be multiplied by the number of functions, though. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

{-# INLINE f PEEL n UNROLL m #-}
The problem here is that this only works for directly recursive functions which I, for instance, don't normally use in high- performance code. Most of my loops are pipelines of collective combinators like map, filter, fold etc. because these are the ones that can be fused automatically. Unless I'm misunderstanding something, this approach doesn't handle such cases.
If the map, filter, fold, etc can be unrolled, then the unrolled definitions would be fused, right? So what is missing is fine control ("how much to unroll this particular call to map here"). Would it help to allow {-# INLINE map PEEL n UNROLL m #-} in the caller modules as well as the definition modules, with the latter providing a general case/upper limit, and the former providing finer control? If you wanted even finer control, one would need a way to specify named copies of inlineable recursion combinators, with PEEL/UNROLL attached to the copies.. I see how this would need addressing, but I don't yet see a good way to specify call-site-specific PEEL/UNROLL for recursion combinators. Unless you want to control it by adding combinators for the purpose?-) Claus

2009/3/19 Claus Reinke
If the map, filter, fold, etc can be unrolled, then the unrolled definitions would be fused, right? So what is missing is fine control ("how much to unroll this particular call to map here").
The issues is that In stream fusion the combinators like "map" are all non-recursive and so unrolling/peeling doesn't make any sense. In fact, their being non-recursive is almost the whole point, because it lets GHC inline them like crazy and hence build a nice efficient fused pipeline of combinators eventually. The recursion is introduced purely in one place - unstream - and even then it doesn't go through unstream but through a locally recursive wrapper (so GHC can see the structure of the stream). So, it might be sufficient if: 1) You changed stream fusion so unstream was directly recursive, but added an INLINE PEEL 1 annotation to it, so if the call site doesn't do any unrollling at least you will still be able to spot the structure of the stream 2) You could, at the call site, add an INLINE PEEL 1 UNROLL n annotation that took the /original/ RHS for unstream and unrolled it however many times the user specifies (you still need a PEEL 1 so you can spot the stream structure in the unrolled loop) Unfortunately, this all feels quite fragile :-( Max

Recursion unfolding spec, 2nd attempt. The main difference is to look at groups of mutually recursive definitions as a whole, rather than trying to think about individual definitions. That step actually seems sufficient to address most of the shortcomings raised so far, such as avoiding runaway INLINE or using PEEL/UNROLL also for mutually recursive definitions. I've also interpreted Max's comments as most of the existing INLINE restriction still making sense for recursive INLINE, with small clarifications. In the following, let REC({f g ..}) denote the set of all identifiers belonging to the recursion involving f, g, .. (f, g, .. in REC({f g ..}) or in {-# INLINE f g .. #-} are required to belong to the same recursion). {-# NOINLINE f #-} as now: no unfolding of f {-# INLINE f #-} as now: for non-recursive f only, unfold definition of f at call sites of f (might in future be taken as go-ahead for analysis-based recursion unfolding) {-# INLINE f g .. PEEL n #-} new: unfold definitions of the named identifiers at their call sites *outside* their recursion group REC({f g ..}). In other words, *entries into* REC({f g ..}) via f, g, .. are unfolded. (for the special case of loops this corresponds to loop peeling) {-# INLINE f g .. UNROLL m #-} new: unfold definitions of the named identifiers at their call sites *inside* their recursion group REC({f g ..}). In other words, *cross-references inside* REC({f g ..}) via f, g, .. are unfolded. (for the special case of loops this corresponds to loop unrolling) {-# INLINE f g .. PEEL n UNROLL m #-} combine the previous two The numeric parameters are to be interpreted as if each call to f, g, .. was annotated with both PEEL and UNROLL limits for the whole recursion group REC({f g ..}), starting with the limits from the pragmas (write f_n_m for a call to f with PEEL limit n and UNROLL limit m), to be decreased for every PEEL or UNROLL action, as follows (REC({f g}) = {f g h}, in these examples): 1. let {-# INLINE f g PEEL n UNROLL m #-} f .. = .. f_?_? .. g_?_? .. h_0_0 .. g .. = .. f_?_? .. g_?_? .. h_0_0 .. h .. = .. f_?_? .. g_?_? .. h_0_0 .. in ..|f_n_m|.. --PEEL--> let {-# INLINE f g PEEL n UNROLL m #-} f .. = .. f_?_? .. g_?_? .. h_0_0 .. g .. = .. f_?_? .. g_?_? .. h_0_0 .. h .. = .. f_?_? .. g_?_? .. h_0_0 .. in ..|.. f_(n-1)_0 .. g_(n-1)_0 .. h_0_0 ..|.. Notes: - unfolding produces copies of definition bodies - the PEEL limit at the call site decides the PEEL limit for all calls to REC({f g}) in the inlined copy; this limit decreases with each PEEL step - since peeling unfolds code into call sites from outside the recursion, the UNROLL limits of calls to REC({f g}) are effectively 0 in the inlined copy - only calls to identifiers named in the INLINE pragma can be peeled (f and g here), calls to other members of the same recursion remain unaffected (h here), having effective limits of 0 2. let {-# INLINE f g PEEL n UNROLL m #-} f .. = .. f_0_m .. g_?_? .. h_0_0 .. g .. = .. f_?_? .. g_?_? .. h_0_0 .. h .. = .. f_?_? .. g_?_? .. h_0_0 .. in .. --UNROLL--> let {-# INLINE f g PEEL n UNROLL m #-} f .. = .. .. f_0_(m-1) .. g_0_(m-1) .. h_0_0 .. .. g_?_? .. h_0_0 .. g .. = .. f_?_? .. g_?_? .. h_0_0 .. h .. = .. f_?_? .. g_?_? .. h_0_0 .. in .. Notes: - unfolding produces copies of definition bodies - the UNROLL limit at the call site decides the UNROLL limit for all calls to REC({f g}) in the inlined copy; this limit decreases with each UNROLL step - peeling conceptually precedes unrolling (PEEL limit needs to reach 0 before unrolling commences), to avoid peeling unrolled definitions (this corresponds to an existing restriction of no inlining into definitions to be inlined; - unrolling unfolds copies of the original definitions, not the already unrolled ones, again corresponding to the existing inlining restriction (TODO: how to specify this avoidance of unrolling unrolled defs in this form of local rule spec?) - only calls to identifiers named in the INLINE pragma can be unrolled (f and g here), calls to other members of the same recursion remain unaffected (h here), having effective limits of 0 Peeling and unrolling stop when the respective count annotation has reached 0. Peeling precedes unrolling, to avoid ambiguities in the size of the peeled definitions. Note that mutual recursion is the domain of PEEL, while UNROLL only applies to (mutual) recursion. {-# INLINE f PEEL n #-}, for n>0, corresponds to worker/ wrapper transforms (previously done manually) + inline wrapper, and should therefore also be taken as a hint for the compiler to try the static argument transformation for f (the "worker"). Non-supporting implementations should treat these as INLINE pragmas (same warning/ignore or automatic unfold behaviour). This might be easier to accomplish if INLINE PEEL/UNROLL were implemented as separate pragmas, even though they are refinements of INLINE conceptually. About the current side-conditions for INLINE pragmas: - no functions inlined into f: still makes sense for PEEL, needs to be adapted with an exception for UNROLL, in that we want to be able to unroll into the function being unrolled, but we want to use the original body for the unrolling, not an already unrolled one (else unrolling would be exponential rather than linear); this appears to be in line with existing work on INLINE - no float-in/float-out/cse: similar to existing INLINE - no worker/wrapper transform in strictness analyser: similar to existing INLINE - loop breakers: PEEL/UNROLL have their own limits, applicable to the whole recursion group, creating intrinsic loop breakers when the counters run out. Every PEEL or UNROLL action creates calls with smaller counters in the inlined copies, if the calls go into the same recursion. If this is an improvement on the first version, and after correcting any obvious issues, I should put it on the ghc trac wiki somewhere, and create a feature request ticket. Claus

2009/3/19 Claus Reinke
Recursion unfolding spec, 2nd attempt.
....
If this is an improvement on the first version, and after correcting any obvious issues, I should put it on the ghc trac wiki somewhere, and create a feature request ticket.
I can't see any issues with this version of the spec. I think in the implementation it makes most sense to do this as a core2core pass at an early stage in the pipeline, probably via plugins (so will have to wait until I get those into HEAD). In the case of PEEL, we don't want to identify all call sites directly and do the substitution in the pass so we should just output some bindings which will certainly be inlined into call sites later on. So, the transformation should be bottom up on the Core syntax tree and when it meets a recursive group of bindings we should do something like this: {-# INLINE f g PEEL 3 UNROLL 2 #-} f = ... g ... f ... h ... g = ... g ... f ... h ... h = ... g ... f ... h ... =(my pass)=> -- Temporary copies of f and g - dead code f_old = ... g_old ... f_old ... h ... g_old = ... g_old ... f_old ... h ... -- H unchanged for now, might get PEELed stuff inlined later h = ... g .. f ... h ... -- Top level unrolled definiiton - if we weren't doing peeling, these would be the new f and g f_unrolled = ... g_unrolled_1 ... f_unrolled_1 ... h ... g_unrolled = ... g_unrolled_1 ... f_unrolled_1 ... h ... -- Unrolled iteration. Will get inlined into f_unrolled / g_unrolled soon {-# INLINE f_unrolled_1 g_unrolled_1 #-} f_unrolled_1 = ... g_unrolled ... f_unrolled ... h ... g_unrolled_1 = ... g_unrolled ... f_unrolled ... h ... -- One level of peeling {-# INLINE f_1 g_1 #-} f_1 = ... g_unrolled ... f_unrolled ... h ... g_1 = ... g_unrolled ... f_unrolled ... h ... -- Second level of peeling {-# INLINE f_2 g_2 #-} f_2 = ... g_1 ... f_1 ... h ... g_2 = ... g_1 ... f_1 ... h ... -- Final level of peeling and new definitions for f and g. Inline pragmas -- make sure all of this gets inlined at the call site {-# INLINE f g #-} f = ... g_2 ... f_2 ... h ... g = ... g_2 ... f_2 ... h ... =(after the simplifier has run - effectively - there are a few harmless lies here)=> -- NB: I haven't shown inlining of the new f and g here, but it /will/ happen h = ... g .. f ... h ... -- I've inlined the inner unrolled iteration at every /call site/ within the top level unrolled iteration, as per -- the pragmas. Noone actualy calls this unrolled thing directly though, since we used PEEL as well f_unrolled = ... (... g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ... f_unrolled ... h ...) ... h ... g_unrolled = ... (... g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ... f_unrolled ... h ...) ... h ... -- This huge chunk of code gets inlined at every call site, which in turn call through to the unrolled bodies {-# INLINE f g #-} f = ... (... (... g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ... f_unrolled ... h ...) ... h ...) ... (... (... g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ... f_unrolled ... h ...) ... h ...) ... h ... g = ... (... (... g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ... f_unrolled ... h ...) ... h ...) ... (... (... g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ... f_unrolled ... h ...) ... h ...) ... h ... By ensuring that f and g are tagged INLINE we get the existing INLINE restrictions automatically in later Core passes. I think that this example transformation matches your spec - am I right? Cheers, Max

I can't see any issues with this version of the spec.
Thanks. From the silence, we seemed to have lost the innocent bystanders? Anyway, for those who haven't noticed, there is now a feature request ticket (for that good feeling of closing it when this is finally implemented;-) as well as a wiki page describing the issues, spec, and examples: http://hackage.haskell.org/trac/ghc/ticket/3123 http://hackage.haskell.org/trac/ghc/wiki/Inlining
I think in the implementation it makes most sense to do this as a core2core pass at an early stage in the pipeline, probably via plugins (so will have to wait until I get those into HEAD).
What are the plans for plugin support? I do think plugins will be useful, but inlining is pretty central to the existing optimizer transformations, isn't it? Would the transformation code differ much between in-GHC and via-plugins? Perhaps the transformation pass could be implemented now, and later moved out into a plugin, possibly along with other passes. I have also been wondering about the relation between rewrite RULES and plugins. Assuming we can find a more convenient syntax, aren't plugin+syb-based rewrites going to be more expressive, with more control than RULES? Or is the syntactic/compiletime overhead going to remain so high that both RULES and plugins will be kept in GHC? (cf the recent thread on "optimization and rewrite rules questions" http://www.haskell.org/pipermail/glasgow-haskell-users/2009-February/016702.... )
In the case of PEEL, we don't want to identify all call sites directly and do the substitution in the pass so we should just output some bindings which will certainly be inlined into call sites later on. So, the transformation should be bottom up on the Core syntax tree and when it meets a recursive group of bindings we should do something like this:
{-# INLINE f g PEEL 3 UNROLL 2 #-} f = ... g ... f ... h ... g = ... g ... f ... h ... h = ... g ... f ... h ...
=(my pass)=>
-- Temporary copies of f and g - dead code f_old = ... g_old ... f_old ... h ... g_old = ... g_old ... f_old ... h ... -- H unchanged for now, might get PEELed stuff inlined later h = ... g .. f ... h ...
You mean UNROLLed stuff (PEEL is only for entries into the group).
-- Top level unrolled definiiton - if we weren't doing peeling, these would be the new f and g f_unrolled = ... g_unrolled_1 ... f_unrolled_1 ... h ... g_unrolled = ... g_unrolled_1 ... f_unrolled_1 ... h ...
-- Unrolled iteration. Will get inlined into f_unrolled / g_unrolled soon {-# INLINE f_unrolled_1 g_unrolled_1 #-} f_unrolled_1 = ... g_unrolled ... f_unrolled ... h ... g_unrolled_1 = ... g_unrolled ... f_unrolled ... h ...
Ah, yes, we need to be unambiguous about the interpretation of the counters:-) I was thinking of n+1 (adding n copies to the original), you are thinking of n (adding copies until there are n).
-- One level of peeling {-# INLINE f_1 g_1 #-} f_1 = ... g_unrolled ... f_unrolled ... h ... g_1 = ... g_unrolled ... f_unrolled ... h ...
-- Second level of peeling {-# INLINE f_2 g_2 #-} f_2 = ... g_1 ... f_1 ... h ... g_2 = ... g_1 ... f_1 ... h ...
-- Final level of peeling and new definitions for f and g. Inline pragmas -- make sure all of this gets inlined at the call site {-# INLINE f g #-} f = ... g_2 ... f_2 ... h ... g = ... g_2 ... f_2 ... h ...
Wait, now you are counting to n+1 for PEEL and to n for UNROLL?
=(after the simplifier has run - effectively - there are a few harmless lies here)=>
-- NB: I haven't shown inlining of the new f and g here, but it /will/ happen h = ... g .. f ... h ...
Since we are interpreting recursive groups as single entities, and there is usually no inlining into definitions that will get inlined, we will have to specify this carefully.
-- I've inlined the inner unrolled iteration at every /call site/ within the top level unrolled iteration, as per -- the pragmas. Noone actualy calls this unrolled thing directly though, since we used PEEL as well f_unrolled = ... (... g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ... f_unrolled ... h ...) ... h ... g_unrolled = ... (... g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ... f_unrolled ... h ...) ... h ...
Again, we have to make sure of this interpretation.
-- This huge chunk of code gets inlined at every call site, which in turn call through to the unrolled bodies {-# INLINE f g #-} f = ... (... (... g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ... f_unrolled ... h ...) ... h ...) ... (... (... g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ... f_unrolled ... h ...) ... h ...) ... h ... g = ... (... (... g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ... f_unrolled ... h ...) ... h ...) ... (... (... g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ... f_unrolled ... h ...) ... h ...) ... h ...
So this would be the result of inlining all the PEEL instances into 'f' and 'g'.
By ensuring that f and g are tagged INLINE we get the existing INLINE restrictions automatically in later Core passes.
So the INLINE gets added after your pass is through, so that it isn't affected, but later passes are. But what if there are multiple such PEEL/ UNROLL definitions handled by the one pass? Since the pass doesn't do general INLINE, that is out of the way, but wouldn't it still PEEL stuff from one group into the definitions from another group, even if those definitions themselves are about to be PEELed/INLINEd? And do we want that or not?
I think that this example transformation matches your spec - am I right?
Looks mostly right, apart from the ambiguities I mentioned. Could you please add your implementation sketch to the wiki page? Claus
participants (9)
-
Brandon S. Allbery KF8NH
-
Bulat Ziganshin
-
Claus Reinke
-
Don Stewart
-
Max Bolingbroke
-
Roman Leshchinskiy
-
Simon Marlow
-
Simon Peyton-Jones
-
Tyson Whitehead