
#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): I can somewhat reproduce this with HEAD. I'm currently focusing on the compiled code issues, ignoring GHCi. My setup: I have two files Main.hs: {{{#!haskell {-# LANGUAGE CPP #-} module Main where import Criterion.Main (defaultMain, bench, nfIO) -- Uncomment this to have all the code in one module -- #define SINGLE_MODULE #ifndef SINGLE_MODULE import List #else import Control.Monad (liftM) data List a = Stop | Yield a (List a) instance Semigroup (List a) where x <> y = case x of Stop -> y Yield a r -> Yield a (mappend r y) instance Monoid (List a) where -- {-# INLINE mempty #-} mempty = Stop -- {-# INLINE mappend #-} mappend = (<>) -- {-# NOINLINE toList #-} toList :: Monad m => List a -> m [a] toList m = case m of Stop -> return [] Yield a r -> liftM (a :) (toList r) #endif {-# NOINLINE len #-} len :: IO Int len = do xs <- toList $ (foldr mappend mempty $ map (\x -> Yield x Stop) [1..100000 :: Int]) return (length xs) main :: IO () main = defaultMain [ bench "len" $ nfIO len ] }}} When I'm measuring allocations I remove criterion imports and use this main: {{{ main = len >>= print }}} Note that I have a `NOINLINE` on `len` to avoid optimising it in the benchmark site. The original report does not have this. List.hs: {{{#!haskell module List where import Control.Monad (liftM) data List a = Stop | Yield a (List a) instance Semigroup (List a) where x <> y = case x of Stop -> y Yield a r -> Yield a (mappend r y) instance Monoid (List a) where mempty = Stop mappend = (<>) toList :: Monad m => List a -> m [a] toList m = case m of Stop -> return [] Yield a r -> liftM (a :) (toList r) }}} I have three configurations: - -O0 - -O1 - -O2 - -O0 -DSINGLE_MODULE - -O1 -DSINGLE_MODULE - -O2 -DSINGLE_MODULE I first run all these with `+RTS -s` using `main = len >>= print` as the main function. {{{ ============ -O0 =============================================================== 49,723,096 bytes allocated in the heap 25,729,264 bytes copied during GC 6,576,744 bytes maximum residency (5 sample(s)) 29,152 bytes maximum slop 13 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 41 colls, 0 par 0.011s 0.011s 0.0003s 0.0008s Gen 1 5 colls, 0 par 0.010s 0.010s 0.0020s 0.0047s INIT time 0.000s ( 0.000s elapsed) MUT time 0.011s ( 0.012s elapsed) GC time 0.021s ( 0.021s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.032s ( 0.033s elapsed) %GC time 64.0% (63.8% elapsed) Alloc rate 4,366,732,069 bytes per MUT second Productivity 35.6% of total user, 35.9% of total elapsed ============ -O1 =============================================================== 28,922,528 bytes allocated in the heap 18,195,344 bytes copied during GC 4,066,200 bytes maximum residency (5 sample(s)) 562,280 bytes maximum slop 13 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 22 colls, 0 par 0.008s 0.008s 0.0004s 0.0016s Gen 1 5 colls, 0 par 0.008s 0.008s 0.0016s 0.0029s INIT time 0.000s ( 0.000s elapsed) MUT time 0.009s ( 0.009s elapsed) GC time 0.016s ( 0.016s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.025s ( 0.025s elapsed) %GC time 63.8% (63.9% elapsed) Alloc rate 3,262,174,222 bytes per MUT second Productivity 35.3% of total user, 35.3% of total elapsed ============ -O2 =============================================================== 28,922,528 bytes allocated in the heap 18,195,344 bytes copied during GC 4,066,200 bytes maximum residency (5 sample(s)) 562,280 bytes maximum slop 13 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 22 colls, 0 par 0.008s 0.008s 0.0003s 0.0008s Gen 1 5 colls, 0 par 0.008s 0.008s 0.0017s 0.0029s INIT time 0.000s ( 0.000s elapsed) MUT time 0.008s ( 0.008s elapsed) GC time 0.016s ( 0.016s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.024s ( 0.024s elapsed) %GC time 66.6% (66.6% elapsed) Alloc rate 3,714,684,268 bytes per MUT second Productivity 32.7% of total user, 32.7% of total elapsed ============ -O0 -DSINGLE_MODULE =============================================== 49,723,032 bytes allocated in the heap 25,729,184 bytes copied during GC 6,576,728 bytes maximum residency (5 sample(s)) 29,152 bytes maximum slop 13 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 41 colls, 0 par 0.010s 0.010s 0.0003s 0.0008s Gen 1 5 colls, 0 par 0.010s 0.010s 0.0019s 0.0042s INIT time 0.000s ( 0.000s elapsed) MUT time 0.011s ( 0.011s elapsed) GC time 0.020s ( 0.020s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.031s ( 0.031s elapsed) %GC time 65.0% (65.0% elapsed) Alloc rate 4,609,752,610 bytes per MUT second Productivity 34.8% of total user, 34.8% of total elapsed ============ -O1 -DSINGLE_MODULE =============================================== 16,122,496 bytes allocated in the heap 7,392,664 bytes copied during GC 3,438,424 bytes maximum residency (4 sample(s)) 55,464 bytes maximum slop 7 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 10 colls, 0 par 0.004s 0.004s 0.0004s 0.0008s Gen 1 4 colls, 0 par 0.005s 0.005s 0.0012s 0.0019s INIT time 0.000s ( 0.000s elapsed) MUT time 0.004s ( 0.004s elapsed) GC time 0.009s ( 0.009s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.014s ( 0.014s elapsed) %GC time 66.5% (66.6% elapsed) Alloc rate 3,663,260,346 bytes per MUT second Productivity 32.5% of total user, 32.5% of total elapsed ============ -O2 -DSINGLE_MODULE =============================================== 13,722,496 bytes allocated in the heap 6,798,640 bytes copied during GC 2,158,376 bytes maximum residency (3 sample(s)) 33,248 bytes maximum slop 7 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 9 colls, 0 par 0.007s 0.007s 0.0008s 0.0021s Gen 1 3 colls, 0 par 0.004s 0.005s 0.0015s 0.0030s INIT time 0.000s ( 0.000s elapsed) MUT time 0.004s ( 0.004s elapsed) GC time 0.012s ( 0.012s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.016s ( 0.016s elapsed) %GC time 74.2% (74.3% elapsed) Alloc rate 3,479,572,009 bytes per MUT second Productivity 25.2% of total user, 25.2% of total elapsed }}} Summary: allocations consistently reduce as optimisation level increases. Secondly I run criterion benchmark to measure runtime, using the same configurations: {{{ ============ -O0 =============================================================== benchmarking len time 13.50 ms (13.23 ms .. 13.71 ms) 0.998 R² (0.997 R² .. 0.999 R²) mean 13.55 ms (13.35 ms .. 13.81 ms) std dev 613.5 μs (424.7 μs .. 918.2 μs) variance introduced by outliers: 18% (moderately inflated) ============ -O1 =============================================================== benchmarking len time 15.83 ms (15.62 ms .. 16.02 ms) 0.999 R² (0.998 R² .. 0.999 R²) mean 15.92 ms (15.75 ms .. 16.10 ms) std dev 463.5 μs (340.2 μs .. 669.1 μs) ============ -O2 =============================================================== benchmarking len time 15.70 ms (15.51 ms .. 15.90 ms) 0.999 R² (0.999 R² .. 1.000 R²) mean 15.74 ms (15.59 ms .. 15.87 ms) std dev 355.2 μs (271.2 μs .. 470.7 μs) ============ -O0 -DSINGLE_MODULE =============================================== benchmarking len time 14.85 ms (13.81 ms .. 16.06 ms) 0.976 R² (0.959 R² .. 0.997 R²) mean 13.60 ms (13.22 ms .. 14.14 ms) std dev 1.152 ms (773.1 μs .. 1.614 ms) variance introduced by outliers: 41% (moderately inflated) ============ -O1 -DSINGLE_MODULE =============================================== benchmarking len time 6.802 ms (6.702 ms .. 6.922 ms) 0.997 R² (0.994 R² .. 0.999 R²) mean 6.845 ms (6.765 ms .. 6.945 ms) std dev 261.8 μs (201.3 μs .. 336.8 μs) variance introduced by outliers: 18% (moderately inflated) ============ -O2 -DSINGLE_MODULE =============================================== benchmarking len time 6.614 ms (6.501 ms .. 6.712 ms) 0.998 R² (0.997 R² .. 0.999 R²) mean 6.399 ms (6.317 ms .. 6.472 ms) std dev 239.1 μs (201.7 μs .. 292.5 μs) variance introduced by outliers: 18% (moderately inflated) }}} So; - Everything works as expected in single module case. Both runtime and allocations get lower as optimisation level increases. - In multi-module -O1 and -O2 produce identical outputs, runtime difference is just noise. - In multi-module we get better allocations with -O1 vs. -O0, but runtime gets somewhat worse. This is what we should investigate. To see why we allocate less in multi-module with -O1 I compared the STG outputs (multi-module -O0 vs. multi-module -O1), the answer is fusion kicking in with -O1. We have an intermediate function application for `foldr mappend mempty` in -O0 output which disappears with -O1. Why does the runtime get worse? I don't know but I suspect it's just noise. Really the code is better (as in, it does less work) with -O1 than with -O0. I also compared single-module -O1 with multi-module -O1, the reason why single module is better is becuase the `toList` function is not inlined cross-module but it's inlined within the module. So I think in the compiled case there are no problems. Only remaining question is why GHCi is faster than compiled code. I've attached a tarball with my setup + outputs. It includes Core/STG outputs of all 6 configurations and criterion and +RTS -s outputs as well. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:27 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler