Optimizations for list comprehension

Hello Cafe, While investigating a performance problem I stumbled upon what I eventually reduced to the example below: module Main where import Data.Time.Clock outside :: Int -> Int outside n = sum [i + j | i <- range, j <- range] where range = [0..n-1] inside :: Int -> Int inside n = sum [i + j | i <- [0..n-1], j <- [0..n-1]] main :: IO () main = do t0 <- getCurrentTime print $ inside 10000 t1 <- getCurrentTime print $ outside 10000 t2 <-getCurrentTime print (diffUTCTime t1 t0) print (diffUTCTime t2 t1) Compiling with -O2, up to GHC 8.2.2, both `inside` and `outside` functions would take the same amount of time to execute. Somewhere between GHC 8.2.2 and 8.6.4 something changed (possibly some new optimization) making `inside` run ~4x faster on my machine. With LLVM the difference is even bigger. It is not that `outside` got slower, but that `inside` got much faster. I'm curious to what optimizations might be happening to the `inside` function that would not fire on the outside function. Any hints? Best regards, Emilio

I'm not sure which exact optimizations are responsible, but based on
--ddump-simple,
* "inside" is not allocating any lists at all. It's just a couple loops
over unboxed ints
* "outside" is actually allocating a (single) list data structure and has
an inner loop and an outer loop, both of which traverse the list
GHC seems to be too aggressive about sharing "range" in "outside". Adding a
unit argument to "range" makes both functions go fast.
On Mon, Aug 19, 2019 at 8:14 PM Emilio Francesquini
Hello Cafe,
While investigating a performance problem I stumbled upon what I eventually reduced to the example below:
module Main where
import Data.Time.Clock
outside :: Int -> Int outside n = sum [i + j | i <- range, j <- range] where range = [0..n-1]
inside :: Int -> Int inside n = sum [i + j | i <- [0..n-1], j <- [0..n-1]]
main :: IO () main = do t0 <- getCurrentTime print $ inside 10000 t1 <- getCurrentTime print $ outside 10000 t2 <-getCurrentTime
print (diffUTCTime t1 t0) print (diffUTCTime t2 t1)
Compiling with -O2, up to GHC 8.2.2, both `inside` and `outside` functions would take the same amount of time to execute. Somewhere between GHC 8.2.2 and 8.6.4 something changed (possibly some new optimization) making `inside` run ~4x faster on my machine. With LLVM the difference is even bigger.
It is not that `outside` got slower, but that `inside` got much faster. I'm curious to what optimizations might be happening to the `inside` function that would not fire on the outside function.
Any hints?
Best regards,
Emilio
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Thanks William!
For me it's quite unexpected to see a unit argument making that kind of
difference for the optimizer...
Anyway, one new trick added to the bag...
Tks.
On Mon, Aug 19, 2019 at 11:34 AM William Yager
I'm not sure which exact optimizations are responsible, but based on --ddump-simple,
* "inside" is not allocating any lists at all. It's just a couple loops over unboxed ints * "outside" is actually allocating a (single) list data structure and has an inner loop and an outer loop, both of which traverse the list
GHC seems to be too aggressive about sharing "range" in "outside". Adding a unit argument to "range" makes both functions go fast.
On Mon, Aug 19, 2019 at 8:14 PM Emilio Francesquini < francesquini@gmail.com> wrote:
Hello Cafe,
While investigating a performance problem I stumbled upon what I eventually reduced to the example below:
module Main where
import Data.Time.Clock
outside :: Int -> Int outside n = sum [i + j | i <- range, j <- range] where range = [0..n-1]
inside :: Int -> Int inside n = sum [i + j | i <- [0..n-1], j <- [0..n-1]]
main :: IO () main = do t0 <- getCurrentTime print $ inside 10000 t1 <- getCurrentTime print $ outside 10000 t2 <-getCurrentTime
print (diffUTCTime t1 t0) print (diffUTCTime t2 t1)
Compiling with -O2, up to GHC 8.2.2, both `inside` and `outside` functions would take the same amount of time to execute. Somewhere between GHC 8.2.2 and 8.6.4 something changed (possibly some new optimization) making `inside` run ~4x faster on my machine. With LLVM the difference is even bigger.
It is not that `outside` got slower, but that `inside` got much faster. I'm curious to what optimizations might be happening to the `inside` function that would not fire on the outside function.
Any hints?
Best regards,
Emilio
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Please file a GHC issue for this, to avoid that Haskell gets slower over time :) On 20/08/2019 21:12, Emilio Francesquini wrote:
For me it's quite unexpected to see a unit argument making that kind of difference for the optimizer...
Anyway, one new trick added to the bag...

Hi, On 20/08/2019 20:12, Emilio Francesquini wrote:
Thanks William!
For me it's quite unexpected to see a unit argument making that kind of difference for the optimizer...
Anyway, one new trick added to the bag...
Tks.
On Mon, Aug 19, 2019 at 11:34 AM William Yager
mailto:will.yager@gmail.com> wrote: I'm not sure which exact optimizations are responsible, but based on --ddump-simple,
* "inside" is not allocating any lists at all. It's just a couple loops over unboxed ints * "outside" is actually allocating a (single) list data structure and has an inner loop and an outer loop, both of which traverse the list
GHC seems to be too aggressive about sharing "range" in "outside". Adding a unit argument to "range" makes both functions go fast.
In "outside", "range" is a single (monomorphic) CAF binding, so I would expect it to be shared aggressively. Even if you do something like "inside" with explicit naming: inside2 :: Int -> Int inside2 n = sum [i + j | i <- range1, j <- range2] where range1 = [0..n-1] range2 = [0..n-1] then "range2" is shared between all iterations over "range1", which I expect to be allocated and traversed. I haven't tested to be sure. Historically I have used crass hacks to prevent sharing by introducing data dependencies: inside3 :: Int -> Int inside3 n = sum [i + j | i <- [0..n-1], j <- [i+0-i..n-1]] That "inside" can be optimized to avoid allocations of the inner loop is a pleasant surprise, I hope it doesn't lead to unpleasant surprises in other circumstances.
On Mon, Aug 19, 2019 at 8:14 PM Emilio Francesquini
mailto:francesquini@gmail.com> wrote: Hello Cafe,
While investigating a performance problem I stumbled upon what I eventually reduced to the example below:
module Main where
import Data.Time.Clock
outside :: Int -> Int outside n = sum [i + j | i <- range, j <- range] where range = [0..n-1]
inside :: Int -> Int inside n = sum [i + j | i <- [0..n-1], j <- [0..n-1]]
main :: IO () main = do t0 <- getCurrentTime print $ inside 10000 t1 <- getCurrentTime print $ outside 10000 t2 <-getCurrentTime
print (diffUTCTime t1 t0) print (diffUTCTime t2 t1)
Compiling with -O2, up to GHC 8.2.2, both `inside` and `outside` functions would take the same amount of time to execute. Somewhere between GHC 8.2.2 and 8.6.4 something changed (possibly some new optimization) making `inside` run ~4x faster on my machine. With LLVM the difference is even bigger.
It is not that `outside` got slower, but that `inside` got much faster. I'm curious to what optimizations might be happening to the `inside` function that would not fire on the outside function.
Any hints?
Claude -- https://mathr.co.uk

I think what is happening is that
* In “inside” there are two distinct lists [0..n-1], each used once; they are fused with their consumers.
* In “outside” there is one list [0..n-1], which is used twice. GHC is paranoid about duplicating work (and would be right to do so if producing the list was expensive), so it does not fuse the list with its two consumers.
In this case GHC’s paranoia is not justified. It’d be better to duplicate the production of [0..n-1] so that it can fuse with its consumers.
One way to address this problem might be “cheapBuild”. There’s a ticket about this: https://gitlab.haskell.org/ghc/ghc/issues/7206. I see that a year ago the TL;DR was “Conclusion: let's do it. Ensuring that a Note gives a clear explanation, and points to this ticket.” But no one yet has.
Maybe someone might see if the cheapBuild idea really does solve this particular case.
Simon
From: Haskell-Cafe
participants (5)
-
Claude Heiland-Allen
-
Emilio Francesquini
-
Niklas Hambüchen
-
Simon Peyton Jones
-
William Yager