[GHC] #14208: Performance with -O2 is worse than with -O0 which is worse than runghc

GHC

9 Sep 9 Sep

1:39 p.m.

#14208: Performance with -O2 is worse than with -O0 which is worse than runghc -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): I added a branch named `does-not-occur` in the repo with a simpler case where the problem does not occur even when the code is split across modules. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

11 Sep 11 Sep

2:44 a.m.

#14208: Performance with -O2 is worse than with -O0 which is worse than runghc -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): Earlier I thought the problem occurs with `-O2` but that is not the case, `-O2` is irrelevant. `-O2` gives the same performance as the default i.e. without any optimization flags. The difference is between `-O0` and the absence of it. Adding `-O0` improves performance drastically. I have updated the github repo and removed `-O2`. This is such an ironic case, `runghc` has the best performance, the next best is `-O0` and any optimization is the worst! Maybe we should just reverse the flags :-) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

3:23 a.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): It was hard to find in the manual the difference between O0 and the default. The manual says "O0 is the default", which seems to be incorrect (NEED TO BE FIXED). So I had to turn to the GHC code. Alright, so O0 seems to ignore or omit "interface-pragmas" whatever the heck they are, from DynFlags.hs: {{{#!hs , ([0], Opt_IgnoreInterfacePragmas) , ([0], Opt_OmitInterfacePragmas) }}} When I used "-fignore-interface-pragmas" I got the same improvement in performance. The GHC manual documents this flag but says nothing about what this really means (NEED TO BE FIXED). I can only guess. Since this has to do something with the interface, this also explains why the performance is good when the whole code is in the same module and bad when the code is split into two modules. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

6:22 a.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by harendra): * cc: simonpj (added) Comment: The pre-inlining flag maps to `pre_inline_unconditionally` in `SimplUtils.hs`. Ccing SPJ who seems to have touched it last via commit 33452dfc6cf. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

7:22 a.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): I measured runghc vs ghc performance for this test case on 7.10.3, 8.0.2 and 8.2.1. It seems runghc has always been faster, though the difference was not much in 7.10.3, runghc seems to have improved a lot in 8.0 performing better than ghc. {{{ 7.10.3 ghc : time 11.43 ms (11.05 ms .. 11.75 ms) 7.10.3 runghc : time 10.55 ms (9.461 ms .. 11.46 ms) 8.0.2 ghc : time 11.00 ms (10.64 ms .. 11.38 ms) 8.0.2 runghc : time 6.441 ms (6.025 ms .. 6.790 ms) 8.2.1 ghc (O0) : time 8.986 ms (8.728 ms .. 9.313 ms) 8.2.1 runghc : time 4.598 ms (4.350 ms .. 4.890 ms) }}} It will be awesome if ghc can be as good as runghc. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

2:29 p.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari):

...

It was hard to find in the manual the difference between `O0` and the default. The manual says "`O0` is the default", which seems to be incorrect (NEED TO BE FIXED).

I've opened #14214 to track this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

11:39 a.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): I added a much simpler example on the "simplified" branch in the same repo. I can paste it here as well: Main.hs {{{#!hs import List ... len :: IO Int len = do xs <- toList $ (foldr (<>) mempty $ map (\x -> Yield x Stop) [1..100000 :: Int]) return (length xs) }}} List.hs {{{#!hs module List where import Control.Monad (liftM) data List a = Stop | Yield a (List a) instance Monoid (List a) where mempty = Stop mappend x y = case x of Stop -> y Yield a r -> Yield a (mappend r y) toList :: Monad m => List a -> m [a] toList m = case m of Stop -> return [] Yield a r -> liftM (a :) (toList r) }}} It essentially generates a custom list in the main module and calls `toList` function from another module to covert it into a Haskell list. The perf difference is not as dramatic as the previous example but it is significant. All in the same module: {{{ -O0 : 14ms -O1 : 8ms -fno-pre-inlining: 4ms }}} Different modules: {{{ -O0 : 8ms -O1 : 14ms -fno-pre-inlining: 8ms INLINE toList : 4ms }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

3:09 p.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): Another question that I am seeking an answer for - is there a combination of options to make a multi-module program behave the same way as if all the code is in a single module, from the performance perspective? I expected that `-fexpose-all-unfoldings` will do that for me but it does not seem to be equivalent. I thought it is equivalent to making all functions INLINABLE but it does not seem to be doing that. Even when using this flag I need to mark a function INLINABLE to get it inlined. What exactly does this flag do? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

5:29 p.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): I guess some function getting inlined too early is preventing list fusion. The combination of `-fexpose-all-unfoldings` and `-fspecialise- aggressively` is not "exactly" equivalent to putting everything in the same module. O1 with everything in the same module finishes in 8ms while with the combination of these two finishes in 4ms. So they do something more. I guess the added effect is that they make everything INLINEABLE. When everything is in the same module and `toList` marked NOINLINE then it takes 14ms (i.e. the worst case) irrespective of the monoid functions being marked INLINE or not. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

6:54 p.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by MikolajKonarski): Replying to [comment:17 harendra]:

...

The combination of `-fexpose-all-unfoldings` and `-fspecialise- aggressively` is not "exactly" equivalent to putting everything in the same module. O1 with everything in the same module finishes in 8ms while with the combination of these two finishes in 4ms. So they do something more. I guess the added effect is that they make everything INLINEABLE.

Yep, forgot that bit. That's exactly what I use the two options for: to be able to split things among modules and to avoid INLINEABLE for every polymorphic function. With this, I only ever need an occasional INLINE in random places, but then it's not for specialization, but real inlining.

...

When everything is in the same module and `toList` marked NOINLINE then it takes 14ms (i.e. the worst case) irrespective of the monoid functions being marked INLINE or not.

And what if they are marked NOINLINE? In any case, that means we now have an example of failed fusion that fits in one module. And additionally, we know that GHC can effectively generate such an example from innocently looking set of modules, by automatically inlining too much (or not enough). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

19 Sep 19 Sep

10:53 a.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): A similar issue has been reported in this stack overflow question as well: https://stackoverflow.com/questions/46296919/haskell-webframeworks-speed- ghci-vs-compiled . There is an appalling difference in the ghc compiled code vs ghci code, in case of snap it is a 50x difference! That sounds unacceptable whatever the reason maybe. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

20 Sep 20 Sep

8:57 a.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by sibi): * cc: sibi (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:23 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

3 Mar 3 Mar

12:15 p.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by lelf): * cc: lelf (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:25 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

27 Mar 27 Mar

7:10 a.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): I can somewhat reproduce this with HEAD. I'm currently focusing on the compiled code issues, ignoring GHCi. My setup: I have two files Main.hs: {{{#!haskell {-# LANGUAGE CPP #-} module Main where import Criterion.Main (defaultMain, bench, nfIO) -- Uncomment this to have all the code in one module -- #define SINGLE_MODULE #ifndef SINGLE_MODULE import List #else import Control.Monad (liftM) data List a = Stop | Yield a (List a) instance Semigroup (List a) where x <> y = case x of Stop -> y Yield a r -> Yield a (mappend r y) instance Monoid (List a) where -- {-# INLINE mempty #-} mempty = Stop -- {-# INLINE mappend #-} mappend = (<>) -- {-# NOINLINE toList #-} toList :: Monad m => List a -> m [a] toList m = case m of Stop -> return [] Yield a r -> liftM (a :) (toList r) #endif {-# NOINLINE len #-} len :: IO Int len = do xs <- toList $ (foldr mappend mempty $ map (\x -> Yield x Stop) [1..100000 :: Int]) return (length xs) main :: IO () main = defaultMain [ bench "len" $ nfIO len ] }}} When I'm measuring allocations I remove criterion imports and use this main: {{{ main = len >>= print }}} Note that I have a `NOINLINE` on `len` to avoid optimising it in the benchmark site. The original report does not have this. List.hs: {{{#!haskell module List where import Control.Monad (liftM) data List a = Stop | Yield a (List a) instance Semigroup (List a) where x <> y = case x of Stop -> y Yield a r -> Yield a (mappend r y) instance Monoid (List a) where mempty = Stop mappend = (<>) toList :: Monad m => List a -> m [a] toList m = case m of Stop -> return [] Yield a r -> liftM (a :) (toList r) }}} I have three configurations: - -O0 - -O1 - -O2 - -O0 -DSINGLE_MODULE - -O1 -DSINGLE_MODULE - -O2 -DSINGLE_MODULE I first run all these with `+RTS -s` using `main = len >>= print` as the main function. {{{ ============ -O0 =============================================================== 49,723,096 bytes allocated in the heap 25,729,264 bytes copied during GC 6,576,744 bytes maximum residency (5 sample(s)) 29,152 bytes maximum slop 13 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 41 colls, 0 par 0.011s 0.011s 0.0003s 0.0008s Gen 1 5 colls, 0 par 0.010s 0.010s 0.0020s 0.0047s INIT time 0.000s ( 0.000s elapsed) MUT time 0.011s ( 0.012s elapsed) GC time 0.021s ( 0.021s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.032s ( 0.033s elapsed) %GC time 64.0% (63.8% elapsed) Alloc rate 4,366,732,069 bytes per MUT second Productivity 35.6% of total user, 35.9% of total elapsed ============ -O1 =============================================================== 28,922,528 bytes allocated in the heap 18,195,344 bytes copied during GC 4,066,200 bytes maximum residency (5 sample(s)) 562,280 bytes maximum slop 13 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 22 colls, 0 par 0.008s 0.008s 0.0004s 0.0016s Gen 1 5 colls, 0 par 0.008s 0.008s 0.0016s 0.0029s INIT time 0.000s ( 0.000s elapsed) MUT time 0.009s ( 0.009s elapsed) GC time 0.016s ( 0.016s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.025s ( 0.025s elapsed) %GC time 63.8% (63.9% elapsed) Alloc rate 3,262,174,222 bytes per MUT second Productivity 35.3% of total user, 35.3% of total elapsed ============ -O2 =============================================================== 28,922,528 bytes allocated in the heap 18,195,344 bytes copied during GC 4,066,200 bytes maximum residency (5 sample(s)) 562,280 bytes maximum slop 13 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 22 colls, 0 par 0.008s 0.008s 0.0003s 0.0008s Gen 1 5 colls, 0 par 0.008s 0.008s 0.0017s 0.0029s INIT time 0.000s ( 0.000s elapsed) MUT time 0.008s ( 0.008s elapsed) GC time 0.016s ( 0.016s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.024s ( 0.024s elapsed) %GC time 66.6% (66.6% elapsed) Alloc rate 3,714,684,268 bytes per MUT second Productivity 32.7% of total user, 32.7% of total elapsed ============ -O0 -DSINGLE_MODULE =============================================== 49,723,032 bytes allocated in the heap 25,729,184 bytes copied during GC 6,576,728 bytes maximum residency (5 sample(s)) 29,152 bytes maximum slop 13 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 41 colls, 0 par 0.010s 0.010s 0.0003s 0.0008s Gen 1 5 colls, 0 par 0.010s 0.010s 0.0019s 0.0042s INIT time 0.000s ( 0.000s elapsed) MUT time 0.011s ( 0.011s elapsed) GC time 0.020s ( 0.020s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.031s ( 0.031s elapsed) %GC time 65.0% (65.0% elapsed) Alloc rate 4,609,752,610 bytes per MUT second Productivity 34.8% of total user, 34.8% of total elapsed ============ -O1 -DSINGLE_MODULE =============================================== 16,122,496 bytes allocated in the heap 7,392,664 bytes copied during GC 3,438,424 bytes maximum residency (4 sample(s)) 55,464 bytes maximum slop 7 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 10 colls, 0 par 0.004s 0.004s 0.0004s 0.0008s Gen 1 4 colls, 0 par 0.005s 0.005s 0.0012s 0.0019s INIT time 0.000s ( 0.000s elapsed) MUT time 0.004s ( 0.004s elapsed) GC time 0.009s ( 0.009s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.014s ( 0.014s elapsed) %GC time 66.5% (66.6% elapsed) Alloc rate 3,663,260,346 bytes per MUT second Productivity 32.5% of total user, 32.5% of total elapsed ============ -O2 -DSINGLE_MODULE =============================================== 13,722,496 bytes allocated in the heap 6,798,640 bytes copied during GC 2,158,376 bytes maximum residency (3 sample(s)) 33,248 bytes maximum slop 7 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 9 colls, 0 par 0.007s 0.007s 0.0008s 0.0021s Gen 1 3 colls, 0 par 0.004s 0.005s 0.0015s 0.0030s INIT time 0.000s ( 0.000s elapsed) MUT time 0.004s ( 0.004s elapsed) GC time 0.012s ( 0.012s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.016s ( 0.016s elapsed) %GC time 74.2% (74.3% elapsed) Alloc rate 3,479,572,009 bytes per MUT second Productivity 25.2% of total user, 25.2% of total elapsed }}} Summary: allocations consistently reduce as optimisation level increases. Secondly I run criterion benchmark to measure runtime, using the same configurations: {{{ ============ -O0 =============================================================== benchmarking len time 13.50 ms (13.23 ms .. 13.71 ms) 0.998 R² (0.997 R² .. 0.999 R²) mean 13.55 ms (13.35 ms .. 13.81 ms) std dev 613.5 μs (424.7 μs .. 918.2 μs) variance introduced by outliers: 18% (moderately inflated) ============ -O1 =============================================================== benchmarking len time 15.83 ms (15.62 ms .. 16.02 ms) 0.999 R² (0.998 R² .. 0.999 R²) mean 15.92 ms (15.75 ms .. 16.10 ms) std dev 463.5 μs (340.2 μs .. 669.1 μs) ============ -O2 =============================================================== benchmarking len time 15.70 ms (15.51 ms .. 15.90 ms) 0.999 R² (0.999 R² .. 1.000 R²) mean 15.74 ms (15.59 ms .. 15.87 ms) std dev 355.2 μs (271.2 μs .. 470.7 μs) ============ -O0 -DSINGLE_MODULE =============================================== benchmarking len time 14.85 ms (13.81 ms .. 16.06 ms) 0.976 R² (0.959 R² .. 0.997 R²) mean 13.60 ms (13.22 ms .. 14.14 ms) std dev 1.152 ms (773.1 μs .. 1.614 ms) variance introduced by outliers: 41% (moderately inflated) ============ -O1 -DSINGLE_MODULE =============================================== benchmarking len time 6.802 ms (6.702 ms .. 6.922 ms) 0.997 R² (0.994 R² .. 0.999 R²) mean 6.845 ms (6.765 ms .. 6.945 ms) std dev 261.8 μs (201.3 μs .. 336.8 μs) variance introduced by outliers: 18% (moderately inflated) ============ -O2 -DSINGLE_MODULE =============================================== benchmarking len time 6.614 ms (6.501 ms .. 6.712 ms) 0.998 R² (0.997 R² .. 0.999 R²) mean 6.399 ms (6.317 ms .. 6.472 ms) std dev 239.1 μs (201.7 μs .. 292.5 μs) variance introduced by outliers: 18% (moderately inflated) }}} So; - Everything works as expected in single module case. Both runtime and allocations get lower as optimisation level increases. - In multi-module -O1 and -O2 produce identical outputs, runtime difference is just noise. - In multi-module we get better allocations with -O1 vs. -O0, but runtime gets somewhat worse. This is what we should investigate. To see why we allocate less in multi-module with -O1 I compared the STG outputs (multi-module -O0 vs. multi-module -O1), the answer is fusion kicking in with -O1. We have an intermediate function application for `foldr mappend mempty` in -O0 output which disappears with -O1. Why does the runtime get worse? I don't know but I suspect it's just noise. Really the code is better (as in, it does less work) with -O1 than with -O0. I also compared single-module -O1 with multi-module -O1, the reason why single module is better is becuase the `toList` function is not inlined cross-module but it's inlined within the module. So I think in the compiled case there are no problems. Only remaining question is why GHCi is faster than compiled code. I've attached a tarball with my setup + outputs. It includes Core/STG outputs of all 6 configurations and criterion and +RTS -s outputs as well. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:27 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

8:07 a.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):

...

So I think in the compiled case there are no problems.

OK good; that's reassuring. Do you know why the single-module case gets better? I suspect it may be that `toList` is specialised. If you add `{-# INLINABLE toList #-}` does the difference go away? Perhaps this isn't a big deal -- it's reasonable for single module to be faster -- but GHC does make real efforts NOT to penalise you for multi- module, so I'm curious.

...

Only remaining question is why GHCi is faster than compiled code.

Can you reproduce this difference? It is indeed puzzling! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:28 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

2:30 p.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): Just updated the previous comment: `toList` is never inlined, but when it's in the same module as the using code or marked as `INLINABLE` it gets specialized. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:30 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

10:06 p.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by mpickering): `fmap` in that module doesn't have an `INLINE` pragma on it? Should it? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:32 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

10:38 p.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by mpickering): I can reproduce this when all the dependencies are installed with `ghc benchmarks/Main.hs -isrc/ -O2` which is slow and `ghc benchmarks/Main.hs -isrc`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:34 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

11:59 p.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by mpickering): Another possible answer is that your library has a lot of recursive functions in it and the base types are written in CPS which means things don't optimise too well. Again, this is not an answer as to why the optimiser makes the program slower. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:36 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

GHC

7:44 a.m.

New subject: [GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):

...

If I change the optimization flags to -O0 for benchmark stanza in cabal file I can get close to ghci performance.

That contradicts what Omer found in comment:27. Nevertheless, if what you say is true, it'd be easier to debug with -O0 than GHCi (which brings the bytecode generator into the picture).

...

GHCi is 6x faster than my regular compiled code

This is totally bonkers and we MUST find out what is happening :-). I suggest not getting diverted into speculation about CPS. We have a repro case; let's just dig into it and find out what is going on. My suggestions * In comment:31 Does the same thing happen with -O0 vs -O, or only with GHCi vs -O? * In all repros, do the huge differences also show up in the bytes- allocated numbers? (If so, we don't need the Criterion apparatus.) * I notice that in comment:27, in the 2-module case, comparing -O0 and -O1: * Allocation is about halved in -O1 * But runtime actually increases That is most peculiar. * Matthew says in comment:34 "I can reproduce this..". That's great. But what is "this" precisely? Which version of GHC? What timing data? What happened to allocation and GC numbers? Somehow a 6x increase in execution time ought not to be hard to find! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:38 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Reply

Sign in to reply online Use email software

tags

participants (1)