Re: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

27 Mar 2018

      #14208: Performance with O0 is much better than the default or with -O2, runghc
performs the best
-------------------------------------+-------------------------------------
        Reporter:  harendra          |                Owner:  osa1
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  8.2.1
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by osa1):

 I can somewhat reproduce this with HEAD.

 I'm currently focusing on the compiled code issues, ignoring GHCi.

 My setup: I have two files

 Main.hs:

 {{{#!haskell
 {-# LANGUAGE CPP #-}

 module Main where

 import Criterion.Main (defaultMain, bench, nfIO)

 -- Uncomment this to have all the code in one module
 -- #define SINGLE_MODULE
 #ifndef SINGLE_MODULE
 import List
 #else
 import Control.Monad  (liftM)

 data List a = Stop | Yield a (List a)

 instance Semigroup (List a) where
     x <> y =
       case x of
           Stop -> y
           Yield a r -> Yield a (mappend r y)

 instance Monoid (List a) where
     -- {-# INLINE mempty #-}
     mempty = Stop
     -- {-# INLINE mappend #-}
     mappend = (<>)

 -- {-# NOINLINE toList #-}
 toList :: Monad m => List a -> m [a]
 toList m =
     case m of
         Stop -> return []
         Yield a r -> liftM (a :) (toList r)

 #endif

 {-# NOINLINE len #-}
 len :: IO Int
 len = do
     xs <- toList $ (foldr mappend mempty $ map (\x -> Yield x Stop)
 [1..100000 :: Int])
     return (length xs)

 main :: IO ()
 main = defaultMain [ bench "len" $ nfIO len ]
 }}}

 When I'm measuring allocations I remove criterion imports and use this
 main:

 {{{
 main = len >>= print
 }}}

 Note that I have a `NOINLINE` on `len` to avoid optimising it in the
 benchmark site. The original report does not have this.

 List.hs:

 {{{#!haskell
 module List where

 import Control.Monad  (liftM)

 data List a = Stop | Yield a (List a)

 instance Semigroup (List a) where
     x <> y =
       case x of
           Stop -> y
           Yield a r -> Yield a (mappend r y)

 instance Monoid (List a) where
     mempty = Stop
     mappend = (<>)

 toList :: Monad m => List a -> m [a]
 toList m =
     case m of
         Stop -> return []
         Yield a r -> liftM (a :) (toList r)
 }}}

 I have three configurations:

 - -O0
 - -O1
 - -O2
 - -O0 -DSINGLE_MODULE
 - -O1 -DSINGLE_MODULE
 - -O2 -DSINGLE_MODULE

 I first run all these with `+RTS -s` using `main = len >>= print` as the
 main function.

 {{{
 ============ -O0
 ===============================================================
       49,723,096 bytes allocated in the heap
       25,729,264 bytes copied during GC
        6,576,744 bytes maximum residency (5 sample(s))
           29,152 bytes maximum slop
               13 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0        41 colls,     0 par    0.011s   0.011s     0.0003s
 0.0008s
   Gen  1         5 colls,     0 par    0.010s   0.010s     0.0020s
 0.0047s

   INIT    time    0.000s  (  0.000s elapsed)
   MUT     time    0.011s  (  0.012s elapsed)
   GC      time    0.021s  (  0.021s elapsed)
   EXIT    time    0.000s  (  0.000s elapsed)
   Total   time    0.032s  (  0.033s elapsed)

   %GC     time      64.0%  (63.8% elapsed)

   Alloc rate    4,366,732,069 bytes per MUT second

   Productivity  35.6% of total user, 35.9% of total elapsed

 ============ -O1
 ===============================================================
       28,922,528 bytes allocated in the heap
       18,195,344 bytes copied during GC
        4,066,200 bytes maximum residency (5 sample(s))
          562,280 bytes maximum slop
               13 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0        22 colls,     0 par    0.008s   0.008s     0.0004s
 0.0016s
   Gen  1         5 colls,     0 par    0.008s   0.008s     0.0016s
 0.0029s

   INIT    time    0.000s  (  0.000s elapsed)
   MUT     time    0.009s  (  0.009s elapsed)
   GC      time    0.016s  (  0.016s elapsed)
   EXIT    time    0.000s  (  0.000s elapsed)
   Total   time    0.025s  (  0.025s elapsed)

   %GC     time      63.8%  (63.9% elapsed)

   Alloc rate    3,262,174,222 bytes per MUT second

   Productivity  35.3% of total user, 35.3% of total elapsed

 ============ -O2
 ===============================================================
       28,922,528 bytes allocated in the heap
       18,195,344 bytes copied during GC
        4,066,200 bytes maximum residency (5 sample(s))
          562,280 bytes maximum slop
               13 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0        22 colls,     0 par    0.008s   0.008s     0.0003s
 0.0008s
   Gen  1         5 colls,     0 par    0.008s   0.008s     0.0017s
 0.0029s

   INIT    time    0.000s  (  0.000s elapsed)
   MUT     time    0.008s  (  0.008s elapsed)
   GC      time    0.016s  (  0.016s elapsed)
   EXIT    time    0.000s  (  0.000s elapsed)
   Total   time    0.024s  (  0.024s elapsed)

   %GC     time      66.6%  (66.6% elapsed)

   Alloc rate    3,714,684,268 bytes per MUT second

   Productivity  32.7% of total user, 32.7% of total elapsed

 ============ -O0 -DSINGLE_MODULE
 ===============================================
       49,723,032 bytes allocated in the heap
       25,729,184 bytes copied during GC
        6,576,728 bytes maximum residency (5 sample(s))
           29,152 bytes maximum slop
               13 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0        41 colls,     0 par    0.010s   0.010s     0.0003s
 0.0008s
   Gen  1         5 colls,     0 par    0.010s   0.010s     0.0019s
 0.0042s

   INIT    time    0.000s  (  0.000s elapsed)
   MUT     time    0.011s  (  0.011s elapsed)
   GC      time    0.020s  (  0.020s elapsed)
   EXIT    time    0.000s  (  0.000s elapsed)
   Total   time    0.031s  (  0.031s elapsed)

   %GC     time      65.0%  (65.0% elapsed)

   Alloc rate    4,609,752,610 bytes per MUT second

   Productivity  34.8% of total user, 34.8% of total elapsed

 ============ -O1 -DSINGLE_MODULE
 ===============================================
       16,122,496 bytes allocated in the heap
        7,392,664 bytes copied during GC
        3,438,424 bytes maximum residency (4 sample(s))
           55,464 bytes maximum slop
                7 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0        10 colls,     0 par    0.004s   0.004s     0.0004s
 0.0008s
   Gen  1         4 colls,     0 par    0.005s   0.005s     0.0012s
 0.0019s

   INIT    time    0.000s  (  0.000s elapsed)
   MUT     time    0.004s  (  0.004s elapsed)
   GC      time    0.009s  (  0.009s elapsed)
   EXIT    time    0.000s  (  0.000s elapsed)
   Total   time    0.014s  (  0.014s elapsed)

   %GC     time      66.5%  (66.6% elapsed)

   Alloc rate    3,663,260,346 bytes per MUT second

   Productivity  32.5% of total user, 32.5% of total elapsed

 ============ -O2 -DSINGLE_MODULE
 ===============================================
       13,722,496 bytes allocated in the heap
        6,798,640 bytes copied during GC
        2,158,376 bytes maximum residency (3 sample(s))
           33,248 bytes maximum slop
                7 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0         9 colls,     0 par    0.007s   0.007s     0.0008s
 0.0021s
   Gen  1         3 colls,     0 par    0.004s   0.005s     0.0015s
 0.0030s

   INIT    time    0.000s  (  0.000s elapsed)
   MUT     time    0.004s  (  0.004s elapsed)
   GC      time    0.012s  (  0.012s elapsed)
   EXIT    time    0.000s  (  0.000s elapsed)
   Total   time    0.016s  (  0.016s elapsed)

   %GC     time      74.2%  (74.3% elapsed)

   Alloc rate    3,479,572,009 bytes per MUT second

   Productivity  25.2% of total user, 25.2% of total elapsed
 }}}

 Summary: allocations consistently reduce as optimisation level increases.

 Secondly I run criterion benchmark to measure runtime, using the same
 configurations:

 {{{
 ============ -O0
 ===============================================================
 benchmarking len
 time                 13.50 ms   (13.23 ms .. 13.71 ms)
                      0.998 R²   (0.997 R² .. 0.999 R²)
 mean                 13.55 ms   (13.35 ms .. 13.81 ms)
 std dev              613.5 μs   (424.7 μs .. 918.2 μs)
 variance introduced by outliers: 18% (moderately inflated)

 ============ -O1
 ===============================================================
 benchmarking len
 time                 15.83 ms   (15.62 ms .. 16.02 ms)
                      0.999 R²   (0.998 R² .. 0.999 R²)
 mean                 15.92 ms   (15.75 ms .. 16.10 ms)
 std dev              463.5 μs   (340.2 μs .. 669.1 μs)

 ============ -O2
 ===============================================================
 benchmarking len
 time                 15.70 ms   (15.51 ms .. 15.90 ms)
                      0.999 R²   (0.999 R² .. 1.000 R²)
 mean                 15.74 ms   (15.59 ms .. 15.87 ms)
 std dev              355.2 μs   (271.2 μs .. 470.7 μs)

 ============ -O0 -DSINGLE_MODULE
 ===============================================
 benchmarking len
 time                 14.85 ms   (13.81 ms .. 16.06 ms)
                      0.976 R²   (0.959 R² .. 0.997 R²)
 mean                 13.60 ms   (13.22 ms .. 14.14 ms)
 std dev              1.152 ms   (773.1 μs .. 1.614 ms)
 variance introduced by outliers: 41% (moderately inflated)

 ============ -O1 -DSINGLE_MODULE
 ===============================================
 benchmarking len
 time                 6.802 ms   (6.702 ms .. 6.922 ms)
                      0.997 R²   (0.994 R² .. 0.999 R²)
 mean                 6.845 ms   (6.765 ms .. 6.945 ms)
 std dev              261.8 μs   (201.3 μs .. 336.8 μs)
 variance introduced by outliers: 18% (moderately inflated)

 ============ -O2 -DSINGLE_MODULE
 ===============================================
 benchmarking len
 time                 6.614 ms   (6.501 ms .. 6.712 ms)
                      0.998 R²   (0.997 R² .. 0.999 R²)
 mean                 6.399 ms   (6.317 ms .. 6.472 ms)
 std dev              239.1 μs   (201.7 μs .. 292.5 μs)
 variance introduced by outliers: 18% (moderately inflated)
 }}}

 So;

 - Everything works as expected in single module case. Both runtime and
 allocations get lower as optimisation level increases.
 - In multi-module -O1 and -O2 produce identical outputs, runtime
 difference is just noise.
 - In multi-module we get better allocations with -O1 vs. -O0, but runtime
 gets somewhat worse. This is what we should investigate.

 To see why we allocate less in multi-module with -O1 I compared the STG
 outputs (multi-module -O0 vs. multi-module -O1), the answer is fusion
 kicking in with -O1. We have an intermediate function application for
 `foldr mappend mempty` in -O0 output which disappears with -O1.

 Why does the runtime get worse? I don't know but I suspect it's just
 noise. Really the code is better (as in, it does less work) with -O1 than
 with -O0.

 I also compared single-module -O1 with multi-module -O1, the reason why
 single module is better is becuase the `toList` function is not inlined
 cross-module but it's inlined within the module.

 So I think in the compiled case there are no problems. Only remaining
 question is why GHCi is faster than compiled code.

 I've attached a tarball with my setup + outputs. It includes Core/STG
 outputs of all 6 configurations and criterion and +RTS -s outputs as well.

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:27
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler