Re: Loop unrolling + fusion ?

1 Mar 2009

      2009/3/1 Claus Reinke :
...
What is the issue with concatMap?
ConcatMap doesn't usually fuse under stream fusion - see
http://www.cse.unsw.edu.au/~dons/papers/stream-fusion.pdf for the gory
details.
...
It sounds like your specialization
is based on the recursion equivalent to loop peeling (unrolling the
"non-recursive" calls, the entry points into the recursion), a close variant
of loop unrolling (unrolling the recursive calls, inside the loop
body).
This sounds right - to get concatMap specialised I "unpeel" the
unstream loop (which has been slightly modified) 4 iterations, after
which unstream recurses back into itself in a tight loop. This lets
GHC specialise the first 3 iterations however it likes. This is
achieved by the spec4 combinator in the code I posted.
...
If followed by the static argument transformation, that might cover
the majority of hand-written worker-wrapper pairs (replacing manual by
compiler optimization is always nice).
Right. Since GHC is so blind to recursion at the moment this could be
a substantial win (though my gut tells me that SAT alone is a large
part of the win here).
...
So, instead of splitting recursive 'f' into 'fWorker' and 'fWrapper', with
an INLINE pragma for 'fWrapper', one might in future be able just to say
'{-# INLINE f PEEL 1 UNROLL 0 #-}' or, if unrolling
is also desirable '{-# INLINE f PEEL 1 UNROLL 8 #-}'? And GHC would do the
work, by unfolding the entry points once (the
inlining of the wrapper), unfolding the recursive calls 8 times (the
loop unrolling), and taking the INLINE PEEL pragma also as a hint to try the
static argument transformation.
Right, and INLINE PEEL might be a nice interface for the user. Of
course, we'd probably want an automated system for working out when
this is a good idea as well - in the same way that we have INLINE
pragmas and a load of inlining heuristics.
...
It seems that strength reduction could be seen as loop restructuring
in the small: a multiplication, if seen as a repeated addition, can be
unrolled, or the implicit adding loop can be fused with the explicit loop in
which multiplication is called on (eg figure 7 in the ACM survey paper I
mentioned).
I hadn't thought about it in quite those terms before - cute :-)
...
That way, no separate framework
would be needed for strength reduction. Btw, is that also what happened in
the -fvia-C path of Don's initial example in this thread (I don't know how
to read those leaqs, but the imulq is gone)?
I am no assembly guru and haven't seen that last form of leaq either,
but I'm going to guess that:

     leaq        (%rsi,%rsi,4), %rax

Says that rax is rsi * ((1 + 1) * 2 ^ 4) = rsi * 32

     leaq        0(,%rax,8), %rsi

And that this finishes it off by adding the final 8* to the mix. So it
makes the multiplication easier by breaking it into two
multiplications by powers of two. Smart, but you don't need any loop
unrolling tech to do it.
...
But all these follow-on optimizations enabled by unfolding recursive
definitions seem to require further thought and design, whereas
user-controlled recursion unfolding (both peel and unroll) seems
to offer immediate benefits. Is that part of your current work?
I hadn't actually considered a mechanism user-controlled peel/unroll
at all! I was totally focused on automatic transformations :-)
...
Do you forsee any problems with the implementation, or with
the API I suggested above (adding PEEL and UNROLL options
to INLINE pragmas, to make them effective on recursive
definitions as well)?
The implementation I'm thinking of is basically trivial. You just add
the information gathered from the pragmas onto the Ids, then have a
dedicated core pass that looks at the pragmas and does it's
worker/wrapper thing. The technology to do peeling/unrolling is
trivial and there already examples in the codebase (in case liberation
and SAT). If someone can spec out what they actually want and GHC HQ
give it the thumbs up I would be happy to do the grunt work on
implementing this feature.

I'm not so sure about the user interface - for the purposes of
compatibility with other compiler's notion of INLINE perhaps a
dedicated PEEL / UNROLL pragma is a good idea. This would be less
painful if we had a positional  notation for pragmas - which has been
mooted in the past wrt. the annotation system for compiler plugins
(which IS in HEAD). AFAIK the only reason we don't have this is that
we haven't had a discussion about how it should look. See "Future
work" on the page http://hackage.haskell.org/trac/ghc/wiki/Annotations

Incidentally, Simon PJ has just made GHC warn about INLINE pragmas on
recursive things (not something I totally sure is a good idea, since
the compiler can make things non-recursive behind your back) but which
you can justify by saying that /normally/ GHC won't INLINE recursive
things, so it's misleading to have INLINE pragmas on them accepted.
This can be taken as an argument against adding PEEL / UNROLL to
INLINE.

Cheers,
Max

Re: Loop unrolling + fusion ?

Max Bolingbroke