Re: Loop optimisation with identical counters

5 Nov 2010

      I spent some time looking at the code generated for llvm and the optimizations
it can apply. There were quite a bit of details to examine and I wrote it up
as blog post here:
http://www.dmpots.com/blog/2010/11/05/optimizing-haskell-loops-with-llvm.htm....

To summarize, I found that it is possible to get LLVM to do this
transformation through a combination of tail-call elimination, inlining,
induction variable optimization, and global value numbering. This works fine
on x86_64 where we can pass parameters in registers, but fails to fully fire
in i386 back end because LLVM gets caught up by aliasing problems because
function parameters are passed on the stack. The possible aliasing between the
stack pointer (Sp) and the function argument pointer (R1) prevented the full
transformation, but it was still able to reduce the f loop to straight line code.

Exploring the details of the code generation for Haskell loops was a useful
exercise. I found several sources of problems for optimizing the generated
code.

1. The ability of LLVM to optimize Haskell functions is limited by the calling
convention. Particularly for i386, function arguments are passed on a stack
that LLVM knows nothing about. The reads and writes to the stack look like
arbitrary loads and stores. It has no notion of popping elements from the
stack which makes it difficult to know when it is ok to eliminate stores to
the stack. 

2. The possible aliasing introduced by casting integer arguments
(R1-R6) to pointers limits the effectiveness of its optimizations.

3. A single Haskell source function is broken up into many small functions in
the back end. Every evaluation of a case statement requires a new continuation
point. These small functions kill the optimization context for LLVM. LLVM can
recover some of the context by inlining calls to known functions, but the
effectiveness of inlining is limited since it does not know that we are
passing some parameters on the stack and not through the actual function call.

4. The order of optimizations matter. We saw that just running `-O2` on the
code may not be enough to get the full optimization effects. To get the full
benefits of inlining in the x86_64 backend, we had to use the heavyweight
sequence `-O2 -inline -std-compiler-opts`.

I am interested in exploring several different opportunities.

* Make the cmm more friendly to LLVM by inlining and making loops in cmm

    I think LLVM would benefit a lot from having a larger optimization
    context. We could relieve some of the burden on LLVM by doing some
    inlining and eliminating tail calls in the cmm itself. GHC knows that it
    is passing arguments on the stack, so it should be able to inline and turn
    tail calls into loops much better than LLVM can.

*  Different calling conventions

    All the functions in the code generated for LLVM use the same calling
    convention fixed by GHC. It would be interesting to see if we could
    generate LLVM code where we pass all the arguments a function needs as
    actual arguments. We can then let LLVM do its optimizations and then have
    a later pass that spills extra arguments to the stack and makes our
    functions use the correct GHC calling convention.

*  Specialization of code after a runtime alias check

    We could specialize the code into two cases, one where some pointers may
    alias and one where they do not. We can then let LLVM fully optimized the
    code with no aliases. We would insert a check at runtime to see if there
    are aliases and then call the correct bit of code.

*  Optimization order matters

    Probably there are some wins to be had by choosing a good optimization
    sequence for the code generated from GHC, rather than just using `-O1`,
    `-O2`, etc. I believe It should be possible to find a good optimization
    sequence that would work well for Haskell codes.

-David

On Nov 4, 2010, at 5:29 AM, Christian Hoener zu Siederdissen wrote:
...
Here it is, feel free to change:
http://hackage.haskell.org/trac/ghc/ticket/4470
I have added the core for the sub-optimal function 'f'. Criterion benchmarks are there, too. It
doesn't make much of a difference for this case -- I'd guess because everything fits into registers
here, anyway.
Gruss,
Christian
On 11/04/2010 09:42 AM, Simon Peyton-Jones wrote:
...
Interesting.  What would it look like in Core?  Anyone care to make a ticket?
S
|  -----Original Message-----
|  From: glasgow-haskell-users-bounces@haskell.org [mailto:glasgow-haskell-users-
|  bounces@haskell.org] On Behalf Of Roman Leshchinskiy
|  Sent: 03 November 2010 10:55
|  To: Christian Hoener zu Siederdissen
|  Cc: glasgow-haskell-users@haskell.org
|  Subject: Re: Loop optimisation with identical counters
|  
|  LLVM doesn't eliminate the counters. FWIW, fixing this would improve performance of
|  stream fusion code quite a bit. It's very easy to do in Core.
|  
|  Roman
|  
|  On 3 Nov 2010, at 10:45, Christian Hoener zu Siederdissen
|   wrote:
|  
|  > Thanks, I'll do some measurements on this with ghc7.
|  >
|  > Gruss,
|  > Christian
|  >
|  > On 11/02/2010 01:23 PM, Simon Marlow wrote:
|  >> On 02/11/2010 08:17, Christian Höner zu Siederdissen wrote:
|  >>> Hi,
|  >>>
|  >>> is the following problem a job for ghc or the code generation backend
|  >>> (llvm)?
|  >>>
|  >>> We are given this program:
|  >>>
|  >>> {-# LANGUAGE BangPatterns #-}
|  >>>
|  >>> module Main where
|  >>>
|  >>> f :: Int ->  Int ->  Int ->  Int ->  Int
|  >>> f !i !j !s !m
|  >>>   | i == 0    = s+m
|  >>>   | otherwise = f (i-1) (j-1) (s + i+1) (m + j*5)
|  >>>
|  >>> g :: Int ->  Int
|  >>> g !k = f k k 0 0
|  >>>
|  >>>
|  >>> ff :: Int ->  Int ->  Int ->  Int
|  >>> ff !i !s !m
|  >>>   | i == 0    = s+m
|  >>>   | otherwise = ff (i-1) (s + i+1) (m + i*5)
|  >>>
|  >>> gg :: Int ->  Int
|  >>> gg !k = ff k 0 0
|  >>>
|  >>> main = do
|  >>>   print $ g 20
|  >>>   print $ gg 20
|  >>>
|  >>>
|  >>> Here, 'f' and 'g' are a representation of the code I have. Both counters
|  >>> 'i' and 'j' in 'f' count from the same value with the same step size and
|  >>> terminate at the same time but are not reduced to just one counter. Can
|  >>> I reasonably expect this to be done by the code generator?
|  >>> 'ff' represents what I would like to see.
|  >>
|  >> GHC doesn't have any optimisations that would do this currently,
|  >> although it's possible that LLVM's loop optimisations might do this on
|  >> the generated code for f.
|  >>
|  >> Cheers,
|  >>    Simon
|  >>
|  >>
|  >>
|  >>> Btw. look at the core, to see that indeed 'f' keep four arguments.
|  >>> Functions like 'f' are a result of vector-fusion at work but can be
|  >>> written by oneself as well. The point is that if 'f' gets reduced to
|  >>> 'ff' then I can have this:
|  >>>
|  >>> fun k = zipWith (+) (map f1 $ mkIdxs k) (map f2 $ mkIdxs k)
|  >>>
|  >>> which makes for nicer code sometimes; but before rewriting I wanted to
|  >>> ask if that kills performance.
|  >>>
|  >>>
|  >>> Thanks,
|  >>> Christian
|  >>>
|  >>>
|  >>>
|  >>> _______________________________________________
|  >>> Glasgow-haskell-users mailing list
|  >>> Glasgow-haskell-users@haskell.org
|  >>> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
|  >>
|  >
|  > _______________________________________________
|  > Glasgow-haskell-users mailing list
|  > Glasgow-haskell-users@haskell.org
|  > http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
|  _______________________________________________
|  Glasgow-haskell-users mailing list
|  Glasgow-haskell-users@haskell.org
|  http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Re: Loop optimisation with identical counters

David Peixotto