Re: [Haskell-cafe] Optimizations for list comprehension

21 Aug 2019

      Hi,

On 20/08/2019 20:12, Emilio Francesquini wrote:
...
Thanks William!
For me it's quite unexpected to see a unit argument making that kind 
of difference for the optimizer...
Anyway, one new trick added to the bag...
Tks.
On Mon, Aug 19, 2019 at 11:34 AM William Yager mailto:will.yager@gmail.com> wrote:
I'm not sure which exact optimizations are responsible, but based
    on --ddump-simple,
* "inside" is not allocating any lists at all. It's just a couple
    loops over unboxed ints
    * "outside" is actually allocating a (single) list data structure
    and has an inner loop and an outer loop, both of which traverse
    the list
GHC seems to be too aggressive about sharing "range" in "outside".
    Adding a unit argument to "range" makes both functions go fast.
In "outside", "range" is a single (monomorphic) CAF binding, so I would 
expect it to be shared aggressively.

Even if you do something like "inside" with explicit naming:

inside2 :: Int -> Int
inside2 n =
   sum [i + j | i <- range1, j <- range2]
   where
     range1 = [0..n-1]
     range2 = [0..n-1]

then "range2" is shared between all iterations over "range1", which I 
expect to be allocated and traversed.  I haven't tested to be sure.

Historically I have used crass hacks to prevent sharing by introducing 
data dependencies:

inside3 :: Int -> Int
inside3 n =
   sum [i + j | i <- [0..n-1], j <- [i+0-i..n-1]]

That "inside" can be optimized to avoid allocations of the inner loop is 
a pleasant surprise, I hope it doesn't lead to unpleasant surprises in 
other circumstances.
...
On Mon, Aug 19, 2019 at 8:14 PM Emilio Francesquini
    mailto:francesquini@gmail.com> wrote:
Hello Cafe,
While investigating a performance problem I stumbled upon what
        I eventually reduced to the example below:
module Main where
import Data.Time.Clock
outside :: Int -> Int
        outside n =
          sum [i + j | i <- range, j <- range]
          where
            range = [0..n-1]
inside :: Int -> Int
        inside n =
          sum [i + j | i <- [0..n-1], j <- [0..n-1]]
main :: IO ()
        main = do
          t0 <- getCurrentTime
          print $ inside 10000
          t1 <- getCurrentTime
          print $ outside 10000
          t2 <-getCurrentTime
  print (diffUTCTime t1 t0)
          print (diffUTCTime t2 t1)
Compiling with -O2, up to GHC 8.2.2, both `inside` and
        `outside` functions would take the same amount of time to
        execute. Somewhere between GHC 8.2.2 and 8.6.4 something
        changed (possibly some new optimization) making `inside` run
        ~4x faster on my machine. With LLVM the difference is even bigger.
It is not that `outside` got slower, but that `inside` got
        much faster. I'm curious to what optimizations might be
        happening to the `inside` function that would not fire on the
        outside function.
Any hints?
Claude
-- 
https://mathr.co.uk

Re: [Haskell-cafe] Optimizations for list comprehension

Claude Heiland-Allen