
Hi, is the following problem a job for ghc or the code generation backend (llvm)? We are given this program: {-# LANGUAGE BangPatterns #-} module Main where f :: Int -> Int -> Int -> Int -> Int f !i !j !s !m | i == 0 = s+m | otherwise = f (i-1) (j-1) (s + i+1) (m + j*5) g :: Int -> Int g !k = f k k 0 0 ff :: Int -> Int -> Int -> Int ff !i !s !m | i == 0 = s+m | otherwise = ff (i-1) (s + i+1) (m + i*5) gg :: Int -> Int gg !k = ff k 0 0 main = do print $ g 20 print $ gg 20 Here, 'f' and 'g' are a representation of the code I have. Both counters 'i' and 'j' in 'f' count from the same value with the same step size and terminate at the same time but are not reduced to just one counter. Can I reasonably expect this to be done by the code generator? 'ff' represents what I would like to see. Btw. look at the core, to see that indeed 'f' keep four arguments. Functions like 'f' are a result of vector-fusion at work but can be written by oneself as well. The point is that if 'f' gets reduced to 'ff' then I can have this: fun k = zipWith (+) (map f1 $ mkIdxs k) (map f2 $ mkIdxs k) which makes for nicer code sometimes; but before rewriting I wanted to ask if that kills performance. Thanks, Christian

On 02/11/2010 08:17, Christian Höner zu Siederdissen wrote:
Hi,
is the following problem a job for ghc or the code generation backend (llvm)?
We are given this program:
{-# LANGUAGE BangPatterns #-}
module Main where
f :: Int -> Int -> Int -> Int -> Int f !i !j !s !m | i == 0 = s+m | otherwise = f (i-1) (j-1) (s + i+1) (m + j*5)
g :: Int -> Int g !k = f k k 0 0
ff :: Int -> Int -> Int -> Int ff !i !s !m | i == 0 = s+m | otherwise = ff (i-1) (s + i+1) (m + i*5)
gg :: Int -> Int gg !k = ff k 0 0
main = do print $ g 20 print $ gg 20
Here, 'f' and 'g' are a representation of the code I have. Both counters 'i' and 'j' in 'f' count from the same value with the same step size and terminate at the same time but are not reduced to just one counter. Can I reasonably expect this to be done by the code generator? 'ff' represents what I would like to see.
GHC doesn't have any optimisations that would do this currently, although it's possible that LLVM's loop optimisations might do this on the generated code for f. Cheers, Simon
Btw. look at the core, to see that indeed 'f' keep four arguments. Functions like 'f' are a result of vector-fusion at work but can be written by oneself as well. The point is that if 'f' gets reduced to 'ff' then I can have this:
fun k = zipWith (+) (map f1 $ mkIdxs k) (map f2 $ mkIdxs k)
which makes for nicer code sometimes; but before rewriting I wanted to ask if that kills performance.
Thanks, Christian
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Thanks, I'll do some measurements on this with ghc7. Gruss, Christian On 11/02/2010 01:23 PM, Simon Marlow wrote:
On 02/11/2010 08:17, Christian Höner zu Siederdissen wrote:
Hi,
is the following problem a job for ghc or the code generation backend (llvm)?
We are given this program:
{-# LANGUAGE BangPatterns #-}
module Main where
f :: Int -> Int -> Int -> Int -> Int f !i !j !s !m | i == 0 = s+m | otherwise = f (i-1) (j-1) (s + i+1) (m + j*5)
g :: Int -> Int g !k = f k k 0 0
ff :: Int -> Int -> Int -> Int ff !i !s !m | i == 0 = s+m | otherwise = ff (i-1) (s + i+1) (m + i*5)
gg :: Int -> Int gg !k = ff k 0 0
main = do print $ g 20 print $ gg 20
Here, 'f' and 'g' are a representation of the code I have. Both counters 'i' and 'j' in 'f' count from the same value with the same step size and terminate at the same time but are not reduced to just one counter. Can I reasonably expect this to be done by the code generator? 'ff' represents what I would like to see.
GHC doesn't have any optimisations that would do this currently, although it's possible that LLVM's loop optimisations might do this on the generated code for f.
Cheers, Simon
Btw. look at the core, to see that indeed 'f' keep four arguments. Functions like 'f' are a result of vector-fusion at work but can be written by oneself as well. The point is that if 'f' gets reduced to 'ff' then I can have this:
fun k = zipWith (+) (map f1 $ mkIdxs k) (map f2 $ mkIdxs k)
which makes for nicer code sometimes; but before rewriting I wanted to ask if that kills performance.
Thanks, Christian
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

On 3 Nov 2010, at 10:45, Christian Hoener zu Siederdissen
Thanks, I'll do some measurements on this with ghc7.
Gruss, Christian
On 11/02/2010 01:23 PM, Simon Marlow wrote:
On 02/11/2010 08:17, Christian Höner zu Siederdissen wrote:
Hi,
is the following problem a job for ghc or the code generation backend (llvm)?
We are given this program:
{-# LANGUAGE BangPatterns #-}
module Main where
f :: Int -> Int -> Int -> Int -> Int f !i !j !s !m | i == 0 = s+m | otherwise = f (i-1) (j-1) (s + i+1) (m + j*5)
g :: Int -> Int g !k = f k k 0 0
ff :: Int -> Int -> Int -> Int ff !i !s !m | i == 0 = s+m | otherwise = ff (i-1) (s + i+1) (m + i*5)
gg :: Int -> Int gg !k = ff k 0 0
main = do print $ g 20 print $ gg 20
Here, 'f' and 'g' are a representation of the code I have. Both counters 'i' and 'j' in 'f' count from the same value with the same step size and terminate at the same time but are not reduced to just one counter. Can I reasonably expect this to be done by the code generator? 'ff' represents what I would like to see.
GHC doesn't have any optimisations that would do this currently, although it's possible that LLVM's loop optimisations might do this on the generated code for f.
Cheers, Simon
Btw. look at the core, to see that indeed 'f' keep four arguments. Functions like 'f' are a result of vector-fusion at work but can be written by oneself as well. The point is that if 'f' gets reduced to 'ff' then I can have this:
fun k = zipWith (+) (map f1 $ mkIdxs k) (map f2 $ mkIdxs k)
which makes for nicer code sometimes; but before rewriting I wanted to ask if that kills performance.
Thanks, Christian
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

LLVM doesn't eliminate the counters. FWIW, fixing this would improve performance of stream fusion code quite a bit. It's very easy to do in Core.
Roman
On 3 Nov 2010, at 10:45, Christian Hoener zu Siederdissen
Thanks, I'll do some measurements on this with ghc7.
Gruss, Christian
On 11/02/2010 01:23 PM, Simon Marlow wrote:
On 02/11/2010 08:17, Christian Höner zu Siederdissen wrote:
Hi,
is the following problem a job for ghc or the code generation backend (llvm)?
We are given this program:
{-# LANGUAGE BangPatterns #-}
module Main where
f :: Int -> Int -> Int -> Int -> Int f !i !j !s !m | i == 0 = s+m | otherwise = f (i-1) (j-1) (s + i+1) (m + j*5)
g :: Int -> Int g !k = f k k 0 0
ff :: Int -> Int -> Int -> Int ff !i !s !m | i == 0 = s+m | otherwise = ff (i-1) (s + i+1) (m + i*5)
gg :: Int -> Int gg !k = ff k 0 0
main = do print $ g 20 print $ gg 20
Here, 'f' and 'g' are a representation of the code I have. Both counters 'i' and 'j' in 'f' count from the same value with the same step size and terminate at the same time but are not reduced to just one counter. Can I reasonably expect this to be done by the code generator? 'ff' represents what I would like to see.
GHC doesn't have any optimisations that would do this currently, although it's possible that LLVM's loop optimisations might do this on the generated code for f.
Cheers, Simon
Btw. look at the core, to see that indeed 'f' keep four arguments. Functions like 'f' are a result of vector-fusion at work but can be written by oneself as well. The point is that if 'f' gets reduced to 'ff' then I can have this:
fun k = zipWith (+) (map f1 $ mkIdxs k) (map f2 $ mkIdxs k)
which makes for nicer code sometimes; but before rewriting I wanted to ask if that kills performance.
Thanks, Christian
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Is there a ticket for this (didn't find one)? Or should there be? For some reason, I'd like to see this in ghc ;-) Gruss, Christian On 11/03/2010 11:54 AM, Roman Leshchinskiy wrote:
LLVM doesn't eliminate the counters. FWIW, fixing this would improve performance of stream fusion code quite a bit. It's very easy to do in Core.
Roman
On 3 Nov 2010, at 10:45, Christian Hoener zu Siederdissen
wrote: Thanks, I'll do some measurements on this with ghc7.
Gruss, Christian
On 11/02/2010 01:23 PM, Simon Marlow wrote:
On 02/11/2010 08:17, Christian Höner zu Siederdissen wrote:
Hi,
is the following problem a job for ghc or the code generation backend (llvm)?
We are given this program:
{-# LANGUAGE BangPatterns #-}
module Main where
f :: Int -> Int -> Int -> Int -> Int f !i !j !s !m | i == 0 = s+m | otherwise = f (i-1) (j-1) (s + i+1) (m + j*5)
g :: Int -> Int g !k = f k k 0 0
ff :: Int -> Int -> Int -> Int ff !i !s !m | i == 0 = s+m | otherwise = ff (i-1) (s + i+1) (m + i*5)
gg :: Int -> Int gg !k = ff k 0 0
main = do print $ g 20 print $ gg 20
Here, 'f' and 'g' are a representation of the code I have. Both counters 'i' and 'j' in 'f' count from the same value with the same step size and terminate at the same time but are not reduced to just one counter. Can I reasonably expect this to be done by the code generator? 'ff' represents what I would like to see.
GHC doesn't have any optimisations that would do this currently, although it's possible that LLVM's loop optimisations might do this on the generated code for f.
Cheers, Simon
Btw. look at the core, to see that indeed 'f' keep four arguments. Functions like 'f' are a result of vector-fusion at work but can be written by oneself as well. The point is that if 'f' gets reduced to 'ff' then I can have this:
fun k = zipWith (+) (map f1 $ mkIdxs k) (map f2 $ mkIdxs k)
which makes for nicer code sometimes; but before rewriting I wanted to ask if that kills performance.
Thanks, Christian
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Interesting. What would it look like in Core? Anyone care to make a ticket?
S
| -----Original Message-----
| From: glasgow-haskell-users-bounces@haskell.org [mailto:glasgow-haskell-users-
| bounces@haskell.org] On Behalf Of Roman Leshchinskiy
| Sent: 03 November 2010 10:55
| To: Christian Hoener zu Siederdissen
| Cc: glasgow-haskell-users@haskell.org
| Subject: Re: Loop optimisation with identical counters
|
| LLVM doesn't eliminate the counters. FWIW, fixing this would improve performance of
| stream fusion code quite a bit. It's very easy to do in Core.
|
| Roman
|
| On 3 Nov 2010, at 10:45, Christian Hoener zu Siederdissen
|

Here it is, feel free to change: http://hackage.haskell.org/trac/ghc/ticket/4470 I have added the core for the sub-optimal function 'f'. Criterion benchmarks are there, too. It doesn't make much of a difference for this case -- I'd guess because everything fits into registers here, anyway. Gruss, Christian On 11/04/2010 09:42 AM, Simon Peyton-Jones wrote:
Interesting. What would it look like in Core? Anyone care to make a ticket?
S
| -----Original Message----- | From: glasgow-haskell-users-bounces@haskell.org [mailto:glasgow-haskell-users- | bounces@haskell.org] On Behalf Of Roman Leshchinskiy | Sent: 03 November 2010 10:55 | To: Christian Hoener zu Siederdissen | Cc: glasgow-haskell-users@haskell.org | Subject: Re: Loop optimisation with identical counters | | LLVM doesn't eliminate the counters. FWIW, fixing this would improve performance of | stream fusion code quite a bit. It's very easy to do in Core. | | Roman | | On 3 Nov 2010, at 10:45, Christian Hoener zu Siederdissen |
wrote: | | > Thanks, I'll do some measurements on this with ghc7. | > | > Gruss, | > Christian | > | > On 11/02/2010 01:23 PM, Simon Marlow wrote: | >> On 02/11/2010 08:17, Christian Höner zu Siederdissen wrote: | >>> Hi, | >>> | >>> is the following problem a job for ghc or the code generation backend | >>> (llvm)? | >>> | >>> We are given this program: | >>> | >>> {-# LANGUAGE BangPatterns #-} | >>> | >>> module Main where | >>> | >>> f :: Int -> Int -> Int -> Int -> Int | >>> f !i !j !s !m | >>> | i == 0 = s+m | >>> | otherwise = f (i-1) (j-1) (s + i+1) (m + j*5) | >>> | >>> g :: Int -> Int | >>> g !k = f k k 0 0 | >>> | >>> | >>> ff :: Int -> Int -> Int -> Int | >>> ff !i !s !m | >>> | i == 0 = s+m | >>> | otherwise = ff (i-1) (s + i+1) (m + i*5) | >>> | >>> gg :: Int -> Int | >>> gg !k = ff k 0 0 | >>> | >>> main = do | >>> print $ g 20 | >>> print $ gg 20 | >>> | >>> | >>> Here, 'f' and 'g' are a representation of the code I have. Both counters | >>> 'i' and 'j' in 'f' count from the same value with the same step size and | >>> terminate at the same time but are not reduced to just one counter. Can | >>> I reasonably expect this to be done by the code generator? | >>> 'ff' represents what I would like to see. | >> | >> GHC doesn't have any optimisations that would do this currently, | >> although it's possible that LLVM's loop optimisations might do this on | >> the generated code for f. | >> | >> Cheers, | >> Simon | >> | >> | >> | >>> Btw. look at the core, to see that indeed 'f' keep four arguments. | >>> Functions like 'f' are a result of vector-fusion at work but can be | >>> written by oneself as well. The point is that if 'f' gets reduced to | >>> 'ff' then I can have this: | >>> | >>> fun k = zipWith (+) (map f1 $ mkIdxs k) (map f2 $ mkIdxs k) | >>> | >>> which makes for nicer code sometimes; but before rewriting I wanted to | >>> ask if that kills performance. | >>> | >>> | >>> Thanks, | >>> Christian | >>> | >>> | >>> | >>> _______________________________________________ | >>> Glasgow-haskell-users mailing list | >>> Glasgow-haskell-users@haskell.org | >>> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users | >> | > | > _______________________________________________ | > Glasgow-haskell-users mailing list | > Glasgow-haskell-users@haskell.org | > http://www.haskell.org/mailman/listinfo/glasgow-haskell-users | _______________________________________________ | Glasgow-haskell-users mailing list | Glasgow-haskell-users@haskell.org | http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

I spent some time looking at the code generated for llvm and the optimizations it can apply. There were quite a bit of details to examine and I wrote it up as blog post here: http://www.dmpots.com/blog/2010/11/05/optimizing-haskell-loops-with-llvm.htm.... To summarize, I found that it is possible to get LLVM to do this transformation through a combination of tail-call elimination, inlining, induction variable optimization, and global value numbering. This works fine on x86_64 where we can pass parameters in registers, but fails to fully fire in i386 back end because LLVM gets caught up by aliasing problems because function parameters are passed on the stack. The possible aliasing between the stack pointer (Sp) and the function argument pointer (R1) prevented the full transformation, but it was still able to reduce the f loop to straight line code. Exploring the details of the code generation for Haskell loops was a useful exercise. I found several sources of problems for optimizing the generated code. 1. The ability of LLVM to optimize Haskell functions is limited by the calling convention. Particularly for i386, function arguments are passed on a stack that LLVM knows nothing about. The reads and writes to the stack look like arbitrary loads and stores. It has no notion of popping elements from the stack which makes it difficult to know when it is ok to eliminate stores to the stack. 2. The possible aliasing introduced by casting integer arguments (R1-R6) to pointers limits the effectiveness of its optimizations. 3. A single Haskell source function is broken up into many small functions in the back end. Every evaluation of a case statement requires a new continuation point. These small functions kill the optimization context for LLVM. LLVM can recover some of the context by inlining calls to known functions, but the effectiveness of inlining is limited since it does not know that we are passing some parameters on the stack and not through the actual function call. 4. The order of optimizations matter. We saw that just running `-O2` on the code may not be enough to get the full optimization effects. To get the full benefits of inlining in the x86_64 backend, we had to use the heavyweight sequence `-O2 -inline -std-compiler-opts`. I am interested in exploring several different opportunities. * Make the cmm more friendly to LLVM by inlining and making loops in cmm I think LLVM would benefit a lot from having a larger optimization context. We could relieve some of the burden on LLVM by doing some inlining and eliminating tail calls in the cmm itself. GHC knows that it is passing arguments on the stack, so it should be able to inline and turn tail calls into loops much better than LLVM can. * Different calling conventions All the functions in the code generated for LLVM use the same calling convention fixed by GHC. It would be interesting to see if we could generate LLVM code where we pass all the arguments a function needs as actual arguments. We can then let LLVM do its optimizations and then have a later pass that spills extra arguments to the stack and makes our functions use the correct GHC calling convention. * Specialization of code after a runtime alias check We could specialize the code into two cases, one where some pointers may alias and one where they do not. We can then let LLVM fully optimized the code with no aliases. We would insert a check at runtime to see if there are aliases and then call the correct bit of code. * Optimization order matters Probably there are some wins to be had by choosing a good optimization sequence for the code generated from GHC, rather than just using `-O1`, `-O2`, etc. I believe It should be possible to find a good optimization sequence that would work well for Haskell codes. -David On Nov 4, 2010, at 5:29 AM, Christian Hoener zu Siederdissen wrote:
Here it is, feel free to change: http://hackage.haskell.org/trac/ghc/ticket/4470
I have added the core for the sub-optimal function 'f'. Criterion benchmarks are there, too. It doesn't make much of a difference for this case -- I'd guess because everything fits into registers here, anyway.
Gruss, Christian
On 11/04/2010 09:42 AM, Simon Peyton-Jones wrote:
Interesting. What would it look like in Core? Anyone care to make a ticket?
S
| -----Original Message----- | From: glasgow-haskell-users-bounces@haskell.org [mailto:glasgow-haskell-users- | bounces@haskell.org] On Behalf Of Roman Leshchinskiy | Sent: 03 November 2010 10:55 | To: Christian Hoener zu Siederdissen | Cc: glasgow-haskell-users@haskell.org | Subject: Re: Loop optimisation with identical counters | | LLVM doesn't eliminate the counters. FWIW, fixing this would improve performance of | stream fusion code quite a bit. It's very easy to do in Core. | | Roman | | On 3 Nov 2010, at 10:45, Christian Hoener zu Siederdissen |
wrote: | | > Thanks, I'll do some measurements on this with ghc7. | > | > Gruss, | > Christian | > | > On 11/02/2010 01:23 PM, Simon Marlow wrote: | >> On 02/11/2010 08:17, Christian Höner zu Siederdissen wrote: | >>> Hi, | >>> | >>> is the following problem a job for ghc or the code generation backend | >>> (llvm)? | >>> | >>> We are given this program: | >>> | >>> {-# LANGUAGE BangPatterns #-} | >>> | >>> module Main where | >>> | >>> f :: Int -> Int -> Int -> Int -> Int | >>> f !i !j !s !m | >>> | i == 0 = s+m | >>> | otherwise = f (i-1) (j-1) (s + i+1) (m + j*5) | >>> | >>> g :: Int -> Int | >>> g !k = f k k 0 0 | >>> | >>> | >>> ff :: Int -> Int -> Int -> Int | >>> ff !i !s !m | >>> | i == 0 = s+m | >>> | otherwise = ff (i-1) (s + i+1) (m + i*5) | >>> | >>> gg :: Int -> Int | >>> gg !k = ff k 0 0 | >>> | >>> main = do | >>> print $ g 20 | >>> print $ gg 20 | >>> | >>> | >>> Here, 'f' and 'g' are a representation of the code I have. Both counters | >>> 'i' and 'j' in 'f' count from the same value with the same step size and | >>> terminate at the same time but are not reduced to just one counter. Can | >>> I reasonably expect this to be done by the code generator? | >>> 'ff' represents what I would like to see. | >> | >> GHC doesn't have any optimisations that would do this currently, | >> although it's possible that LLVM's loop optimisations might do this on | >> the generated code for f. | >> | >> Cheers, | >> Simon | >> | >> | >> | >>> Btw. look at the core, to see that indeed 'f' keep four arguments. | >>> Functions like 'f' are a result of vector-fusion at work but can be | >>> written by oneself as well. The point is that if 'f' gets reduced to | >>> 'ff' then I can have this: | >>> | >>> fun k = zipWith (+) (map f1 $ mkIdxs k) (map f2 $ mkIdxs k) | >>> | >>> which makes for nicer code sometimes; but before rewriting I wanted to | >>> ask if that kills performance. | >>> | >>> | >>> Thanks, | >>> Christian | >>> | >>> | >>> | >>> _______________________________________________ | >>> Glasgow-haskell-users mailing list | >>> Glasgow-haskell-users@haskell.org | >>> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users | >> | > | > _______________________________________________ | > Glasgow-haskell-users mailing list | > Glasgow-haskell-users@haskell.org | > http://www.haskell.org/mailman/listinfo/glasgow-haskell-users | _______________________________________________ | Glasgow-haskell-users mailing list | Glasgow-haskell-users@haskell.org | http://www.haskell.org/mailman/listinfo/glasgow-haskell-users _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

On 05/11/2010, at 23:22, David Peixotto wrote:
I spent some time looking at the code generated for llvm and the optimizations it can apply. There were quite a bit of details to examine and I wrote it up as blog post here: http://www.dmpots.com/blog/2010/11/05/optimizing-haskell-loops-with-llvm.htm....
Nice! Thanks a lot for doing that!
To summarize, I found that it is possible to get LLVM to do this transformation through a combination of tail-call elimination, inlining, induction variable optimization, and global value numbering. This works fine on x86_64 where we can pass parameters in registers, but fails to fully fire in i386 back end because LLVM gets caught up by aliasing problems because function parameters are passed on the stack. The possible aliasing between the stack pointer (Sp) and the function argument pointer (R1) prevented the full transformation, but it was still able to reduce the f loop to straight line code.
Hmm... IIRC we agreed that Sp is never aliased in GHC-generated code and David Terei (I'm cc'ing here, not sure if he reads the list) made sure to include appropriate annotations in Haskell code. In fact, in your post Sp is passed as i32* noalias nocapture %Sp_Arg. Isn't that enough for LLVM to know that Sp isn't aliased?
1. The ability of LLVM to optimize Haskell functions is limited by the calling convention. Particularly for i386, function arguments are passed on a stack that LLVM knows nothing about. The reads and writes to the stack look like arbitrary loads and stores. It has no notion of popping elements from the stack which makes it difficult to know when it is ok to eliminate stores to the stack.
But shouldn't it just promote stack locations to registers?
2. The possible aliasing introduced by casting integer arguments (R1-R6) to pointers limits the effectiveness of its optimizations.
Yes, that's a big problem. David tried to solve some of it by including noalias annotations but it's not clear what to do with, say, newly allocated ByteArrays which can't be aliased by anything. Anyway, it's great to know that there are things we can improve to make LLVM optimise better. Roman

Hi Roman, On Nov 5, 2010, at 6:44 PM, Roman Leshchinskiy wrote:
On 05/11/2010, at 23:22, David Peixotto wrote:
I spent some time looking at the code generated for llvm and the optimizations it can apply. There were quite a bit of details to examine and I wrote it up as blog post here: http://www.dmpots.com/blog/2010/11/05/optimizing-haskell-loops-with-llvm.htm....
Nice! Thanks a lot for doing that! My pleasure :)
To summarize, I found that it is possible to get LLVM to do this transformation through a combination of tail-call elimination, inlining, induction variable optimization, and global value numbering. This works fine on x86_64 where we can pass parameters in registers, but fails to fully fire in i386 back end because LLVM gets caught up by aliasing problems because function parameters are passed on the stack. The possible aliasing between the stack pointer (Sp) and the function argument pointer (R1) prevented the full transformation, but it was still able to reduce the f loop to straight line code.
Hmm... IIRC we agreed that Sp is never aliased in GHC-generated code and David Terei (I'm cc'ing here, not sure if he reads the list) made sure to include appropriate annotations in Haskell code. In fact, in your post Sp is passed as i32* noalias nocapture %Sp_Arg. Isn't that enough for LLVM to know that Sp isn't aliased?
Yes, the LLVM code has Sp, Hp, Base all annotated as noalias. I believe that Sp, Hp, and Base should never alias, but a (boxed) R1 should always alias with either Sp or Hp. I had a hard time determining exactly how LLVM uses the noalias annotation, but playing with opt -print-alias-sets I saw that Sp was a MayAlias with the pointers derived from R1. I would guess that casting an int to a pointer (like we do for R1) makes that pointer MayAlias with everything regardless of the noalias annotation.
1. The ability of LLVM to optimize Haskell functions is limited by the calling convention. Particularly for i386, function arguments are passed on a stack that LLVM knows nothing about. The reads and writes to the stack look like arbitrary loads and stores. It has no notion of popping elements from the stack which makes it difficult to know when it is ok to eliminate stores to the stack.
But shouldn't it just promote stack locations to registers?
Yes, LLVM can and will promote the stack locations to registers, but since it doesn't know that Sp is really a stack, it is difficult for it to tell when it can avoid the writes back to the stack even though *we* know they will not be visible once the function call returns.
2. The possible aliasing introduced by casting integer arguments (R1-R6) to pointers limits the effectiveness of its optimizations.
Yes, that's a big problem. David tried to solve some of it by including noalias annotations but it's not clear what to do with, say, newly allocated ByteArrays which can't be aliased by anything.
It may profitable to write our own alias analysis pass for LLVM that encodes our knowledge of what can alias in the GHC world view. It wouldn't be useful for other LLVM clients, but might be a good option for us.
Anyway, it's great to know that there are things we can improve to make LLVM optimise better.
Yeah, I'm generally very impressed with what LLVM is able to do with the code from GHC. Any help we can give it will just make it that much better!
Roman

On 06/11/2010, at 00:28, David Peixotto wrote:
Yes, the LLVM code has Sp, Hp, Base all annotated as noalias. I believe that Sp, Hp, and Base should never alias, but a (boxed) R1 should always alias with either Sp or Hp. I had a hard time determining exactly how LLVM uses the noalias annotation, but playing with opt -print-alias-sets I saw that Sp was a MayAlias with the pointers derived from R1. I would guess that casting an int to a pointer (like we do for R1) makes that pointer MayAlias with everything regardless of the noalias annotation.
Are you sure about R1 aliasing Sp? AFAIK, R1 points to a closure on the heap, not to a stack location. That is, it can alias pointers on the stack or Hp but it can't alias the Sp itself. I don't think Sp can be aliased by anything outside of the garbage collector. Perhaps we shouldn't mark Hp as noalias, though.
But shouldn't it just promote stack locations to registers?
Yes, LLVM can and will promote the stack locations to registers, but since it doesn't know that Sp is really a stack, it is difficult for it to tell when it can avoid the writes back to the stack even though *we* know they will not be visible once the function call returns.
Right, I meant GHC stack locations. Let me rephrase my question: shouldn't it just promote array locations to registers?
It may profitable to write our own alias analysis pass for LLVM that encodes our knowledge of what can alias in the GHC world view. It wouldn't be useful for other LLVM clients, but might be a good option for us.
Actually, I think our aliasing properties should be fairly close to those of, say, Java. I wonder how LLVM deals with those.
Yeah, I'm generally very impressed with what LLVM is able to do with the code from GHC. Any help we can give it will just make it that much better!
I have to say I'm slightly disappointed with what LLVM does with tight loops generated by GHC. That's not necessarily LLVM's fault, you are quite right that we should probably give it more information. Roman

On Nov 5, 2010, at 7:55 PM, Roman Leshchinskiy wrote:
On 06/11/2010, at 00:28, David Peixotto wrote:
Yes, the LLVM code has Sp, Hp, Base all annotated as noalias. I believe that Sp, Hp, and Base should never alias, but a (boxed) R1 should always alias with either Sp or Hp. I had a hard time determining exactly how LLVM uses the noalias annotation, but playing with opt -print-alias-sets I saw that Sp was a MayAlias with the pointers derived from R1. I would guess that casting an int to a pointer (like we do for R1) makes that pointer MayAlias with everything regardless of the noalias annotation.
Are you sure about R1 aliasing Sp? AFAIK, R1 points to a closure on the heap, not to a stack location. That is, it can alias pointers on the stack or Hp but it can't alias the Sp itself. I don't think Sp can be aliased by anything outside of the garbage collector.
Perhaps we shouldn't mark Hp as noalias, though.
Well, I'm not sure about R1 aliasing with Sp. I thought that there could be some cases where closures are allocated on the stack, but I could be wrong. I think the stack should still be reachable by the garbage collector though. Can someone more familiar with GHC internals say whether R1 could point to the stack as well as the heap?
But shouldn't it just promote stack locations to registers?
Yes, LLVM can and will promote the stack locations to registers, but since it doesn't know that Sp is really a stack, it is difficult for it to tell when it can avoid the writes back to the stack even though *we* know they will not be visible once the function call returns.
Right, I meant GHC stack locations. Let me rephrase my question: shouldn't it just promote array locations to registers?
Yes, it should promote array locations to (virtual) registers. I was mentioning the stack because I was thinking of something like this: x = Sp[0] x = x + 1 Sp[0] = x Sp = Sp - 4 return x where x is a stack allocated parameter. LLVM has no way to know that the write back to the stack (Sp[0] = x) is redundant because it sees Sp as an arbitrary pointer. We know that write is redundant because the stack space is dealloacated before returning x.
It may profitable to write our own alias analysis pass for LLVM that encodes our knowledge of what can alias in the GHC world view. It wouldn't be useful for other LLVM clients, but might be a good option for us.
Actually, I think our aliasing properties should be fairly close to those of, say, Java. I wonder how LLVM deals with those.
That's a good question. I don't think LLVM supports type-based alias analysis which makes it much easier to disambiguate pointers in the Java world. Perhaps type information could help the GHC back end with alias analysis as well.
Yeah, I'm generally very impressed with what LLVM is able to do with the code from GHC. Any help we can give it will just make it that much better!
I have to say I'm slightly disappointed with what LLVM does with tight loops generated by GHC. That's not necessarily LLVM's fault, you are quite right that we should probably give it more information.
Yes, the more that Haskell loops look like the kind of loops that LLVM is accustomed to seeing the better it should be at optimizing them. -David

On 6 November 2010 04:47, David Peixotto
Are you sure about R1 aliasing Sp? AFAIK, R1 points to a closure on the heap, not to a stack location. That is, it can alias pointers on the stack or Hp but it can't alias the Sp itself. I don't think Sp can be aliased by anything outside of the garbage collector.
Perhaps we shouldn't mark Hp as noalias, though.
Well, I'm not sure about R1 aliasing with Sp. I thought that there could be some cases where closures are allocated on the stack, but I could be wrong. I think the stack should still be reachable by the garbage collector though. Can someone more familiar with GHC internals say whether R1 could point to the stack as well as the heap?
GHC marks some closures as "let-no-escapes" which means that they get stack allocated. So AFAIK R1 may alias pointers based on SP. Cheers, Max

On 08/11/2010 15:44, Max Bolingbroke wrote:
On 6 November 2010 04:47, David Peixotto
wrote: Are you sure about R1 aliasing Sp? AFAIK, R1 points to a closure on the heap, not to a stack location. That is, it can alias pointers on the stack or Hp but it can't alias the Sp itself. I don't think Sp can be aliased by anything outside of the garbage collector.
Perhaps we shouldn't mark Hp as noalias, though.
Well, I'm not sure about R1 aliasing with Sp. I thought that there could be some cases where closures are allocated on the stack, but I could be wrong. I think the stack should still be reachable by the garbage collector though. Can someone more familiar with GHC internals say whether R1 could point to the stack as well as the heap?
GHC marks some closures as "let-no-escapes" which means that they get stack allocated. So AFAIK R1 may alias pointers based on SP.
There isn't really a "stack allocated closure" for a let-no-escape, we just explain it like that because it is notionally what is going on. The implementation actually treats them more like case continuations where the free variables have been saved on the stack. R1 is used as an argument register for let-no-escapes, in fact. So I think the original conclusion was right - R1 doesn't alias with the stack. Cheers, Simon

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/5/10 19:22 , David Peixotto wrote:
Probably there are some wins to be had by choosing a good optimization sequence for the code generated from GHC, rather than just using `-O1`, `-O2`, etc. I believe It should be possible to find a good optimization sequence that would work well for Haskell codes.
Didn't someone (dons?) already make a start on this? - -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkzUtdQACgkQIn7hlCsL25VG6ACeJ3sSXoI4YLbXW3KIFVMqKqdK oTsAn23bxl0mvfdl3up69xM4qWPnklGj =TXBk -----END PGP SIGNATURE-----

On Nov 5, 2010, at 8:56 PM, Brandon S Allbery KF8NH wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 11/5/10 19:22 , David Peixotto wrote:
Probably there are some wins to be had by choosing a good optimization sequence for the code generated from GHC, rather than just using `-O1`, `-O2`, etc. I believe It should be possible to find a good optimization sequence that would work well for Haskell codes.
Didn't someone (dons?) already make a start on this?
Searching for good compiler sequences is certainly not a new idea. Most work that I know focuses on finding good sequences for a particular program. I think an interesting opportunity here would be to search for good sequences that generally work well for Haskell programs to replace the standard -O1, -O2 used by LLVM. I would think the code generated from GHC is different enough that we should be able to find standard sequences that we could use to replace the ones currently used by LLVM.
- -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAkzUtdQACgkQIn7hlCsL25VG6ACeJ3sSXoI4YLbXW3KIFVMqKqdK oTsAn23bxl0mvfdl3up69xM4qWPnklGj =TXBk -----END PGP SIGNATURE----- _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

On Nov 6, 2010, at 8:22 AM, David Peixotto wrote:
To summarize, I found that it is possible to get LLVM to do this transformation through a combination of tail-call elimination, inlining, induction variable optimization, and global value numbering.
Interesting. This approach requires `f` to be inlined into its call site in order to eliminate the redundant argument. This is different from the proposal to provide a specialized version of `f` (where the arguments are combined) which could be reused at different call sites. How many call sites with identical arguments are there in the generated code that triggered this discussion and in the stream fusion code that would benefit from this optimization? Sebastian

On 06/11/2010, at 02:27, Sebastian Fischer wrote:
Interesting. This approach requires `f` to be inlined into its call site in order to eliminate the redundant argument. This is different from the proposal to provide a specialized version of `f` (where the arguments are combined) which could be reused at different call sites.
Which proposal do you mean? I'm not sure something like that is feasible without knowing the call sites. You have to know which arguments to combine.
How many call sites with identical arguments are there in the generated code that triggered this discussion and in the stream fusion code that would benefit from this optimization?
In stream fusion code, there is normally exactly one call site. I suspect that the Christian's example has also been derived from stream fusion code. Roman

Which proposal do you mean?
I referred to Christian's questions whether it is possible to generate `ff` and `gg` from `f` and `g`. If `h` is similar to `g`, then `hh` could reuse `ff` while with an inlining approach something like `ff` would be duplicated.
I'm not sure something like that is feasible without knowing the call sites.
I agree. One would need to generate variants for different call sites and reuse variants for similar call sites. Sebastian

On 05/11/2010 23:22, David Peixotto wrote:
1. The ability of LLVM to optimize Haskell functions is limited by the calling convention. Particularly for i386, function arguments are passed on a stack that LLVM knows nothing about. The reads and writes to the stack look like arbitrary loads and stores. It has no notion of popping elements from the stack which makes it difficult to know when it is ok to eliminate stores to the stack.
Our longish-term plan is to make the x86 backend pass arguments in registers. We've been limited up to now by having to compile via C, where if you want to use specific registers they have to be reserved globally, but once that limitation is lifted we can start to use more registers for argument passing in the x86 backend. A few other things have to be in place for that to happen though, and I think we'll only really be able to do this in the "new backend", which is still at the prototype stage.
2. The possible aliasing introduced by casting integer arguments (R1-R6) to pointers limits the effectiveness of its optimizations.
3. A single Haskell source function is broken up into many small functions in the back end. Every evaluation of a case statement requires a new continuation point. These small functions kill the optimization context for LLVM. LLVM can recover some of the context by inlining calls to known functions, but the effectiveness of inlining is limited since it does not know that we are passing some parameters on the stack and not through the actual function call.
Again, this is something I think we'll be able to improve. Some of the splitting into small functions is due to the limitations of the C backend.
* Different calling conventions
All the functions in the code generated for LLVM use the same calling convention fixed by GHC. It would be interesting to see if we could generate LLVM code where we pass all the arguments a function needs as actual arguments. We can then let LLVM do its optimizations and then have a later pass that spills extra arguments to the stack and makes our functions use the correct GHC calling convention.
For functions local to a module it would certainly be possible to use different calling conventions, although there are restrictions on the use of the C stack (basically the C stack pointer cannot move while we're in Haskell code), so the C stack can't be used for argument passing. I don't think you'd want to do that anyway. Cheers, Simon
* Specialization of code after a runtime alias check
We could specialize the code into two cases, one where some pointers may alias and one where they do not. We can then let LLVM fully optimized the code with no aliases. We would insert a check at runtime to see if there are aliases and then call the correct bit of code.
* Optimization order matters
Probably there are some wins to be had by choosing a good optimization sequence for the code generated from GHC, rather than just using `-O1`, `-O2`, etc. I believe It should be possible to find a good optimization sequence that would work well for Haskell codes.
-David
On Nov 4, 2010, at 5:29 AM, Christian Hoener zu Siederdissen wrote:
Here it is, feel free to change: http://hackage.haskell.org/trac/ghc/ticket/4470
I have added the core for the sub-optimal function 'f'. Criterion benchmarks are there, too. It doesn't make much of a difference for this case -- I'd guess because everything fits into registers here, anyway.
Gruss, Christian
On 11/04/2010 09:42 AM, Simon Peyton-Jones wrote:
Interesting. What would it look like in Core? Anyone care to make a ticket?
S
| -----Original Message----- | From: glasgow-haskell-users-bounces@haskell.org [mailto:glasgow-haskell-users- | bounces@haskell.org] On Behalf Of Roman Leshchinskiy | Sent: 03 November 2010 10:55 | To: Christian Hoener zu Siederdissen | Cc: glasgow-haskell-users@haskell.org | Subject: Re: Loop optimisation with identical counters | | LLVM doesn't eliminate the counters. FWIW, fixing this would improve performance of | stream fusion code quite a bit. It's very easy to do in Core. | | Roman | | On 3 Nov 2010, at 10:45, Christian Hoener zu Siederdissen |
wrote: | |> Thanks, I'll do some measurements on this with ghc7. |> |> Gruss, |> Christian |> |> On 11/02/2010 01:23 PM, Simon Marlow wrote: |>> On 02/11/2010 08:17, Christian Höner zu Siederdissen wrote: |>>> Hi, |>>> |>>> is the following problem a job for ghc or the code generation backend |>>> (llvm)? |>>> |>>> We are given this program: |>>> |>>> {-# LANGUAGE BangPatterns #-} |>>> |>>> module Main where |>>> |>>> f :: Int -> Int -> Int -> Int -> Int |>>> f !i !j !s !m |>>> | i == 0 = s+m |>>> | otherwise = f (i-1) (j-1) (s + i+1) (m + j*5) |>>> |>>> g :: Int -> Int |>>> g !k = f k k 0 0 |>>> |>>> |>>> ff :: Int -> Int -> Int -> Int |>>> ff !i !s !m |>>> | i == 0 = s+m |>>> | otherwise = ff (i-1) (s + i+1) (m + i*5) |>>> |>>> gg :: Int -> Int |>>> gg !k = ff k 0 0 |>>> |>>> main = do |>>> print $ g 20 |>>> print $ gg 20 |>>> |>>> |>>> Here, 'f' and 'g' are a representation of the code I have. Both counters |>>> 'i' and 'j' in 'f' count from the same value with the same step size and |>>> terminate at the same time but are not reduced to just one counter. Can |>>> I reasonably expect this to be done by the code generator? |>>> 'ff' represents what I would like to see. |>> |>> GHC doesn't have any optimisations that would do this currently, |>> although it's possible that LLVM's loop optimisations might do this on |>> the generated code for f. |>> |>> Cheers, |>> Simon |>> |>> |>> |>>> Btw. look at the core, to see that indeed 'f' keep four arguments. |>>> Functions like 'f' are a result of vector-fusion at work but can be |>>> written by oneself as well. The point is that if 'f' gets reduced to |>>> 'ff' then I can have this: |>>> |>>> fun k = zipWith (+) (map f1 $ mkIdxs k) (map f2 $ mkIdxs k) |>>> |>>> which makes for nicer code sometimes; but before rewriting I wanted to |>>> ask if that kills performance. |>>> |>>> |>>> Thanks, |>>> Christian |>>> |>>> |>>> |>>> _______________________________________________ |>>> Glasgow-haskell-users mailing list |>>> Glasgow-haskell-users@haskell.org |>>> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users |>> |> |> _______________________________________________ |> Glasgow-haskell-users mailing list |> Glasgow-haskell-users@haskell.org |> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users | _______________________________________________ | Glasgow-haskell-users mailing list | Glasgow-haskell-users@haskell.org | http://www.haskell.org/mailman/listinfo/glasgow-haskell-users _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
participants (9)
-
Brandon S Allbery KF8NH
-
Christian Hoener zu Siederdissen
-
Christian Höner zu Siederdissen
-
David Peixotto
-
Max Bolingbroke
-
Roman Leshchinskiy
-
Sebastian Fischer
-
Simon Marlow
-
Simon Peyton-Jones