
I spent some time looking at the code generated for llvm and the optimizations it can apply. There were quite a bit of details to examine and I wrote it up as blog post here: http://www.dmpots.com/blog/2010/11/05/optimizing-haskell-loops-with-llvm.htm.... To summarize, I found that it is possible to get LLVM to do this transformation through a combination of tail-call elimination, inlining, induction variable optimization, and global value numbering. This works fine on x86_64 where we can pass parameters in registers, but fails to fully fire in i386 back end because LLVM gets caught up by aliasing problems because function parameters are passed on the stack. The possible aliasing between the stack pointer (Sp) and the function argument pointer (R1) prevented the full transformation, but it was still able to reduce the f loop to straight line code. Exploring the details of the code generation for Haskell loops was a useful exercise. I found several sources of problems for optimizing the generated code. 1. The ability of LLVM to optimize Haskell functions is limited by the calling convention. Particularly for i386, function arguments are passed on a stack that LLVM knows nothing about. The reads and writes to the stack look like arbitrary loads and stores. It has no notion of popping elements from the stack which makes it difficult to know when it is ok to eliminate stores to the stack. 2. The possible aliasing introduced by casting integer arguments (R1-R6) to pointers limits the effectiveness of its optimizations. 3. A single Haskell source function is broken up into many small functions in the back end. Every evaluation of a case statement requires a new continuation point. These small functions kill the optimization context for LLVM. LLVM can recover some of the context by inlining calls to known functions, but the effectiveness of inlining is limited since it does not know that we are passing some parameters on the stack and not through the actual function call. 4. The order of optimizations matter. We saw that just running `-O2` on the code may not be enough to get the full optimization effects. To get the full benefits of inlining in the x86_64 backend, we had to use the heavyweight sequence `-O2 -inline -std-compiler-opts`. I am interested in exploring several different opportunities. * Make the cmm more friendly to LLVM by inlining and making loops in cmm I think LLVM would benefit a lot from having a larger optimization context. We could relieve some of the burden on LLVM by doing some inlining and eliminating tail calls in the cmm itself. GHC knows that it is passing arguments on the stack, so it should be able to inline and turn tail calls into loops much better than LLVM can. * Different calling conventions All the functions in the code generated for LLVM use the same calling convention fixed by GHC. It would be interesting to see if we could generate LLVM code where we pass all the arguments a function needs as actual arguments. We can then let LLVM do its optimizations and then have a later pass that spills extra arguments to the stack and makes our functions use the correct GHC calling convention. * Specialization of code after a runtime alias check We could specialize the code into two cases, one where some pointers may alias and one where they do not. We can then let LLVM fully optimized the code with no aliases. We would insert a check at runtime to see if there are aliases and then call the correct bit of code. * Optimization order matters Probably there are some wins to be had by choosing a good optimization sequence for the code generated from GHC, rather than just using `-O1`, `-O2`, etc. I believe It should be possible to find a good optimization sequence that would work well for Haskell codes. -David On Nov 4, 2010, at 5:29 AM, Christian Hoener zu Siederdissen wrote:
Here it is, feel free to change: http://hackage.haskell.org/trac/ghc/ticket/4470
I have added the core for the sub-optimal function 'f'. Criterion benchmarks are there, too. It doesn't make much of a difference for this case -- I'd guess because everything fits into registers here, anyway.
Gruss, Christian
On 11/04/2010 09:42 AM, Simon Peyton-Jones wrote:
Interesting. What would it look like in Core? Anyone care to make a ticket?
S
| -----Original Message----- | From: glasgow-haskell-users-bounces@haskell.org [mailto:glasgow-haskell-users- | bounces@haskell.org] On Behalf Of Roman Leshchinskiy | Sent: 03 November 2010 10:55 | To: Christian Hoener zu Siederdissen | Cc: glasgow-haskell-users@haskell.org | Subject: Re: Loop optimisation with identical counters | | LLVM doesn't eliminate the counters. FWIW, fixing this would improve performance of | stream fusion code quite a bit. It's very easy to do in Core. | | Roman | | On 3 Nov 2010, at 10:45, Christian Hoener zu Siederdissen |
wrote: | | > Thanks, I'll do some measurements on this with ghc7. | > | > Gruss, | > Christian | > | > On 11/02/2010 01:23 PM, Simon Marlow wrote: | >> On 02/11/2010 08:17, Christian Höner zu Siederdissen wrote: | >>> Hi, | >>> | >>> is the following problem a job for ghc or the code generation backend | >>> (llvm)? | >>> | >>> We are given this program: | >>> | >>> {-# LANGUAGE BangPatterns #-} | >>> | >>> module Main where | >>> | >>> f :: Int -> Int -> Int -> Int -> Int | >>> f !i !j !s !m | >>> | i == 0 = s+m | >>> | otherwise = f (i-1) (j-1) (s + i+1) (m + j*5) | >>> | >>> g :: Int -> Int | >>> g !k = f k k 0 0 | >>> | >>> | >>> ff :: Int -> Int -> Int -> Int | >>> ff !i !s !m | >>> | i == 0 = s+m | >>> | otherwise = ff (i-1) (s + i+1) (m + i*5) | >>> | >>> gg :: Int -> Int | >>> gg !k = ff k 0 0 | >>> | >>> main = do | >>> print $ g 20 | >>> print $ gg 20 | >>> | >>> | >>> Here, 'f' and 'g' are a representation of the code I have. Both counters | >>> 'i' and 'j' in 'f' count from the same value with the same step size and | >>> terminate at the same time but are not reduced to just one counter. Can | >>> I reasonably expect this to be done by the code generator? | >>> 'ff' represents what I would like to see. | >> | >> GHC doesn't have any optimisations that would do this currently, | >> although it's possible that LLVM's loop optimisations might do this on | >> the generated code for f. | >> | >> Cheers, | >> Simon | >> | >> | >> | >>> Btw. look at the core, to see that indeed 'f' keep four arguments. | >>> Functions like 'f' are a result of vector-fusion at work but can be | >>> written by oneself as well. The point is that if 'f' gets reduced to | >>> 'ff' then I can have this: | >>> | >>> fun k = zipWith (+) (map f1 $ mkIdxs k) (map f2 $ mkIdxs k) | >>> | >>> which makes for nicer code sometimes; but before rewriting I wanted to | >>> ask if that kills performance. | >>> | >>> | >>> Thanks, | >>> Christian | >>> | >>> | >>> | >>> _______________________________________________ | >>> Glasgow-haskell-users mailing list | >>> Glasgow-haskell-users@haskell.org | >>> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users | >> | > | > _______________________________________________ | > Glasgow-haskell-users mailing list | > Glasgow-haskell-users@haskell.org | > http://www.haskell.org/mailman/listinfo/glasgow-haskell-users | _______________________________________________ | Glasgow-haskell-users mailing list | Glasgow-haskell-users@haskell.org | http://www.haskell.org/mailman/listinfo/glasgow-haskell-users _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users