
Hi, On 20/08/2019 20:12, Emilio Francesquini wrote:
Thanks William!
For me it's quite unexpected to see a unit argument making that kind of difference for the optimizer...
Anyway, one new trick added to the bag...
Tks.
On Mon, Aug 19, 2019 at 11:34 AM William Yager
mailto:will.yager@gmail.com> wrote: I'm not sure which exact optimizations are responsible, but based on --ddump-simple,
* "inside" is not allocating any lists at all. It's just a couple loops over unboxed ints * "outside" is actually allocating a (single) list data structure and has an inner loop and an outer loop, both of which traverse the list
GHC seems to be too aggressive about sharing "range" in "outside". Adding a unit argument to "range" makes both functions go fast.
In "outside", "range" is a single (monomorphic) CAF binding, so I would expect it to be shared aggressively. Even if you do something like "inside" with explicit naming: inside2 :: Int -> Int inside2 n = sum [i + j | i <- range1, j <- range2] where range1 = [0..n-1] range2 = [0..n-1] then "range2" is shared between all iterations over "range1", which I expect to be allocated and traversed. I haven't tested to be sure. Historically I have used crass hacks to prevent sharing by introducing data dependencies: inside3 :: Int -> Int inside3 n = sum [i + j | i <- [0..n-1], j <- [i+0-i..n-1]] That "inside" can be optimized to avoid allocations of the inner loop is a pleasant surprise, I hope it doesn't lead to unpleasant surprises in other circumstances.
On Mon, Aug 19, 2019 at 8:14 PM Emilio Francesquini
mailto:francesquini@gmail.com> wrote: Hello Cafe,
While investigating a performance problem I stumbled upon what I eventually reduced to the example below:
module Main where
import Data.Time.Clock
outside :: Int -> Int outside n = sum [i + j | i <- range, j <- range] where range = [0..n-1]
inside :: Int -> Int inside n = sum [i + j | i <- [0..n-1], j <- [0..n-1]]
main :: IO () main = do t0 <- getCurrentTime print $ inside 10000 t1 <- getCurrentTime print $ outside 10000 t2 <-getCurrentTime
print (diffUTCTime t1 t0) print (diffUTCTime t2 t1)
Compiling with -O2, up to GHC 8.2.2, both `inside` and `outside` functions would take the same amount of time to execute. Somewhere between GHC 8.2.2 and 8.6.4 something changed (possibly some new optimization) making `inside` run ~4x faster on my machine. With LLVM the difference is even bigger.
It is not that `outside` got slower, but that `inside` got much faster. I'm curious to what optimizations might be happening to the `inside` function that would not fire on the outside function.
Any hints?
Claude -- https://mathr.co.uk