[GHC] #14208: Performance with -O2 is worse than with -O0 which is worse than runghc

#14208: Performance with -O2 is worse than with -O0 which is worse than runghc -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: Runtime Unknown/Multiple | performance bug Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- In this particular case `-O2` is 2x slower than `-O0` and `-O0` is 2x slower than `runghc`. Please see the github repo: https://github.com /harendra-kumar/ghc-perf to reproduce the issue. Readme file in the repo has instructions to reproduce. The issue seems to occur when the code is placed in a different module. When all the code is in the same module the problem does not occur. In that case `-O2` is faster than `-O0`. However, when the code is split into two modules the performance gets inverted. Also, it does not occur always, when I tried to change the code to make it simpler for repro the problem did not occur. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with -O2 is worse than with -O0 which is worse than runghc -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): I added a branch named `does-not-occur` in the repo with a simpler case where the problem does not occur even when the code is split across modules. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with -O2 is worse than with -O0 which is worse than runghc -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by mpickering): I can reproduce this. Thanks for the excellent test case. You can also build everything with `cabal new-build benchmarks`. Adding an `{- INLINABLE toList -}` improves the situation a lot but it is still slower than with -O0. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with -O2 is worse than with -O0 which is worse than runghc -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): Earlier I thought the problem occurs with `-O2` but that is not the case, `-O2` is irrelevant. `-O2` gives the same performance as the default i.e. without any optimization flags. The difference is between `-O0` and the absence of it. Adding `-O0` improves performance drastically. I have updated the github repo and removed `-O2`. This is such an ironic case, `runghc` has the best performance, the next best is `-O0` and any optimization is the worst! Maybe we should just reverse the flags :-) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by harendra: Old description:
In this particular case `-O2` is 2x slower than `-O0` and `-O0` is 2x slower than `runghc`. Please see the github repo: https://github.com /harendra-kumar/ghc-perf to reproduce the issue. Readme file in the repo has instructions to reproduce.
The issue seems to occur when the code is placed in a different module. When all the code is in the same module the problem does not occur. In that case `-O2` is faster than `-O0`. However, when the code is split into two modules the performance gets inverted.
Also, it does not occur always, when I tried to change the code to make it simpler for repro the problem did not occur.
New description: In this particular case `-O2` or the default is 2x slower than `-O0` and `-O0` is 2x slower than `runghc`. Please see the github repo: https://github.com/harendra-kumar/ghc-perf to reproduce the issue. Readme file in the repo has instructions to reproduce. The issue seems to occur when the code is placed in a different module. When all the code is in the same module the problem does not occur. In that case `-O2` or the default is faster than `-O0`. However, when the code is split into two modules the performance gets inverted. Also, it does not occur always, when I tried to change the code to make it simpler for repro the problem did not occur. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): It was hard to find in the manual the difference between O0 and the default. The manual says "O0 is the default", which seems to be incorrect (NEED TO BE FIXED). So I had to turn to the GHC code. Alright, so O0 seems to ignore or omit "interface-pragmas" whatever the heck they are, from DynFlags.hs: {{{#!hs , ([0], Opt_IgnoreInterfacePragmas) , ([0], Opt_OmitInterfacePragmas) }}} When I used "-fignore-interface-pragmas" I got the same improvement in performance. The GHC manual documents this flag but says nothing about what this really means (NEED TO BE FIXED). I can only guess. Since this has to do something with the interface, this also explains why the performance is good when the whole code is in the same module and bad when the code is split into two modules. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): Alright, I guessed that ignoring the interface disables inlining so the problem has something to do with inlining. I looked at all the inlining options available and narrowed down the problem to `-fno-pre-inlining`. Just using `-fno-pre-inlining` brings us back to the performance that we are seeing with -O0. The fine manual does not say what pre-inlining is so I have to look at the GHC code again. I am confused by the fine manual. It says: {{{ -fno-pre-inlining Default: off }}} Does that mean pre-inlining is on by default? Can we make this clearer, explicit? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by harendra): * cc: simonpj (added) Comment: The pre-inlining flag maps to `pre_inline_unconditionally` in `SimplUtils.hs`. Ccing SPJ who seems to have touched it last via commit 33452dfc6cf. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): I can confirm that this problem does not manifest in GHC 8.0.2. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): I measured runghc vs ghc performance for this test case on 7.10.3, 8.0.2 and 8.2.1. It seems runghc has always been faster, though the difference was not much in 7.10.3, runghc seems to have improved a lot in 8.0 performing better than ghc. {{{ 7.10.3 ghc : time 11.43 ms (11.05 ms .. 11.75 ms) 7.10.3 runghc : time 10.55 ms (9.461 ms .. 11.46 ms) 8.0.2 ghc : time 11.00 ms (10.64 ms .. 11.38 ms) 8.0.2 runghc : time 6.441 ms (6.025 ms .. 6.790 ms) 8.2.1 ghc (O0) : time 8.986 ms (8.728 ms .. 9.313 ms) 8.2.1 runghc : time 4.598 ms (4.350 ms .. 4.890 ms) }}} It will be awesome if ghc can be as good as runghc. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by MikolajKonarski): Interesting. From what you write, my guess would be that when all functions are inlined (e.g., everything is placed in the same module) or when none get inlined, performance is good, but some combination of inlined and not inlined functions causes the slowdown (such things do happen and they are impossible to avoid). If I was to try and minimize this example, I'd put everything in the same module and then explore all combinations of INLINE and NOINLINE pragmas on functions. BTW, does the slowdown vanish if you enable -fexpose-all-unfoldings (and -fspecialise- aggressively for a good measure)? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari):
It was hard to find in the manual the difference between `O0` and the default. The manual says "`O0` is the default", which seems to be incorrect (NEED TO BE FIXED).
I've opened #14214 to track this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra):
my guess would be that when all functions are inlined (e.g., everything is placed in the same module) or when none get inlined, performance is good,
That seems to be a correct observation. When everything is in the same module the simplifier can choose to inline any of the functions while when they are in different modules only those functions that are marked INLINE/INLINABLE get inlined. This is supported by the fact that when we mark the `toList` function INLINE the difference is eliminated to a large extent. Also, it does not go away with `-fexpose-all-unfoldings` but it goes away with `-fspecialise-aggressively`. I also suspect that there is an interaction of `foldr/build` fusion with the simplifier. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): I added a much simpler example on the "simplified" branch in the same repo. I can paste it here as well: Main.hs {{{#!hs import List ... len :: IO Int len = do xs <- toList $ (foldr (<>) mempty $ map (\x -> Yield x Stop) [1..100000 :: Int]) return (length xs) }}} List.hs {{{#!hs module List where import Control.Monad (liftM) data List a = Stop | Yield a (List a) instance Monoid (List a) where mempty = Stop mappend x y = case x of Stop -> y Yield a r -> Yield a (mappend r y) toList :: Monad m => List a -> m [a] toList m = case m of Stop -> return [] Yield a r -> liftM (a :) (toList r) }}} It essentially generates a custom list in the main module and calls `toList` function from another module to covert it into a Haskell list. The perf difference is not as dramatic as the previous example but it is significant. All in the same module: {{{ -O0 : 14ms -O1 : 8ms -fno-pre-inlining: 4ms }}} Different modules: {{{ -O0 : 8ms -O1 : 14ms -fno-pre-inlining: 8ms INLINE toList : 4ms }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): When I look at `-ddump-rewrite-rules` output the list fusion rules seem to get fired in the `-fno-pre-inlining` case but not in the `-O1` case. Specifically I do not see the "map" rule (and subsequent other rules triggered by it) in the `-O1` output. As long as we understand why it is happening and there is no easy way to fix it, I guess it is fine. But I hope this can be fixed so that we do not spend time wondering about and figuring out such things and manually tweaking options and pragmas to get to the right combination. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): Another question that I am seeking an answer for - is there a combination of options to make a multi-module program behave the same way as if all the code is in a single module, from the performance perspective? I expected that `-fexpose-all-unfoldings` will do that for me but it does not seem to be equivalent. I thought it is equivalent to making all functions INLINABLE but it does not seem to be doing that. Even when using this flag I need to mark a function INLINABLE to get it inlined. What exactly does this flag do? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by MikolajKonarski): Great job simplifying the example! Every bit of code eliminated helps GHC hackers immensely to tackle this (which may take a while anyway, especially given the busy time of year). Even if the contrived example looks nonsensical and silly. There is less noise, fewer suspects and the Core is less overwhelming. The combination of `-fexpose-all-unfoldings` and `-fspecialise- aggressively` is close to (or even equivalent) to putting everything in a single module. See https://ghc.haskell.org/trac/ghc/ticket/12463#comment:19 and other comments in that thread. The only drawback is that GHC takes much more memory. A hack around that (at least before 8.2.1) is to restart GHC during compilation when it hogs too much memory (or travis_retry after out-of-memory segfault in travis scripts). If you see worse fusion behaviour `-O1` than `-O0`, I guess here is hope it can be fine-tuned in that particular case or even that it's a bug. I wonder who is hacking on the fusion machinery these days... But in general, inlining (especially of only a subset of functions) that makes performance worse is a fact of life, though GHC strives hard not to _automatically_ inline in such suspect cases. I wonder, if you marked `toList` NOINLINE and the Monoid methods INLINE, but put everything in the same module, would you still have the bad behaviour? That would hint that the (partial) inlining inhibits fusion and would show which combination of inline decisions is responsible (so that GHC may be improved for that combination or may be prevented from automatically generating such partial inlining). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): I guess some function getting inlined too early is preventing list fusion. The combination of `-fexpose-all-unfoldings` and `-fspecialise- aggressively` is not "exactly" equivalent to putting everything in the same module. O1 with everything in the same module finishes in 8ms while with the combination of these two finishes in 4ms. So they do something more. I guess the added effect is that they make everything INLINEABLE. When everything is in the same module and `toList` marked NOINLINE then it takes 14ms (i.e. the worst case) irrespective of the monoid functions being marked INLINE or not. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by mpickering): Harendra, have you read my post about how this part of GHC works? It might be useful for you or at least useful to point other people to. Your tickets have good examples in, I will try to get back to them once I am back from holiday. http://mpickering.github.io/posts/2017-03-20-inlining-and- specialisation.html -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by MikolajKonarski): Replying to [comment:17 harendra]:
The combination of `-fexpose-all-unfoldings` and `-fspecialise- aggressively` is not "exactly" equivalent to putting everything in the same module. O1 with everything in the same module finishes in 8ms while with the combination of these two finishes in 4ms. So they do something more. I guess the added effect is that they make everything INLINEABLE.
Yep, forgot that bit. That's exactly what I use the two options for: to be able to split things among modules and to avoid INLINEABLE for every polymorphic function. With this, I only ever need an occasional INLINE in random places, but then it's not for specialization, but real inlining.
When everything is in the same module and `toList` marked NOINLINE then it takes 14ms (i.e. the worst case) irrespective of the monoid functions being marked INLINE or not.
And what if they are marked NOINLINE? In any case, that means we now have an example of failed fusion that fits in one module. And additionally, we know that GHC can effectively generate such an example from innocently looking set of modules, by automatically inlining too much (or not enough). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): Thanks Matthew, that post is a good reference and it confirmed a few things I was guessing about. Some of this can perhaps go into the GHC manual. Mikolaj, there is no difference when the Monoid instance functions are also marked NOINLINE in addition to `toList`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): A similar issue has been reported in this stack overflow question as well: https://stackoverflow.com/questions/46296919/haskell-webframeworks-speed- ghci-vs-compiled . There is an appalling difference in the ghc compiled code vs ghci code, in case of snap it is a 50x difference! That sounds unacceptable whatever the reason maybe. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): This is odd. I can't reproduce what you are seeing. I get this, measuring allocation {{{ HEAD, separate modules, but with {-# INLINABLE #-} on toList -O0 496M -O1 160M -O1 -fno-pre-inlining 312M HEAD, One module -O0 496M -O1 160M -O1 -fno-pre-inlining 312M GHC-8.2-branch, one module -O0 496M -O1 160M -O1 -fno-pre-inlining 312M }}} Here's the code I'm using (I had to add `instance Semigroup (List a)`: {{{ module Main where import Control.Monad (liftM) import Data.Semigroup (Semigroup(..)) -- import T14208a ------------------------------- data List a = Stop | Yield a (List a) instance Monoid (List a) where mempty = Stop mappend x y = case x of Stop -> y Yield a r -> Yield a (mappend r y) instance Semigroup (List a) where (<>) = mappend toList :: Monad m => List a -> m [a] toList m = case m of Stop -> return [] Yield a r -> liftM (a :) (toList r) -------------------------------- len :: IO Int len = do xs <- toList $ (foldr mappend mempty $ map (\x -> Yield x Stop) [1..1000000 :: Int]) return (length xs) main = do { x <- len; print x } }}} With "two files" I pushed the stuff between the lines into a separate file. I did an an INLINABLE pragma for `toList`. And I multiplied the limit count by 10. I did not use Criterion (less to depend on). Can you say more precisely how to reproduce the problem? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by sibi): * cc: sibi (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:23 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): It seems, in my repo the cabal file is building the library without any optimization flag and in the different benchmark runs the flag is being changed only on the benchmark module and not on the library. So there is a mixup of optimization flags. Here is the new matrix taking that into account: {{{ Main.hs List.hs INLINE Default ---------------------------------------------------- Identical Flags ---------------------------------------------------- -O1 -O1 4.6 ms 14.2 ms -O0 -O0 14.2 ms 14.2 ms -fno-pre-inlining -fno-pre-inlining 4.6 ms 9.9 ms ---------------------------------------------------- Mixed Flags ---------------------------------------------------- -fno-pre-inlining -O1 4.6 ms 8.8 ms -O0 -O1 8.8 ms 8.8 ms ---------------------------------------------------- runghc ---------------------------------------------------- runghc -O0 5.2 ms 5.2 ms runghc -O1 4.7 ms 4.7 ms }}} Observations: 1. When `toList` is INLINEd the results are more or less expected. Simon, what you are seeing is the INLINE column with identical flags. 2. In the default case (no pragmas are used) `-fno-pre-inlining` does better than `-O1` and runghc seems to be doing well irrespective of the flag used to build the library (i.e. List.hs). Does it mean that `-O1` can also do better in this case? 3. Mixing up the optimization flags brings one more variable in the picture. I would like to ignore those cases. What does GHC recommend? If this is not recommended, is there a way to warn the user when the flags are mixed up? If not, will it be possible to implement something like that? 4. In my original package I am still seeing `-O0` as well as `runghc` doing much better than `-O2`, even when using INLINE pragmas and identical optimization options for all code. I guess, I need to work again to get a simplified example keeping these in mind. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:24 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by lelf): * cc: lelf (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:25 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * owner: (none) => osa1 Comment: Omer will look. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:26 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): I can somewhat reproduce this with HEAD. I'm currently focusing on the compiled code issues, ignoring GHCi. My setup: I have two files Main.hs: {{{#!haskell {-# LANGUAGE CPP #-} module Main where import Criterion.Main (defaultMain, bench, nfIO) -- Uncomment this to have all the code in one module -- #define SINGLE_MODULE #ifndef SINGLE_MODULE import List #else import Control.Monad (liftM) data List a = Stop | Yield a (List a) instance Semigroup (List a) where x <> y = case x of Stop -> y Yield a r -> Yield a (mappend r y) instance Monoid (List a) where -- {-# INLINE mempty #-} mempty = Stop -- {-# INLINE mappend #-} mappend = (<>) -- {-# NOINLINE toList #-} toList :: Monad m => List a -> m [a] toList m = case m of Stop -> return [] Yield a r -> liftM (a :) (toList r) #endif {-# NOINLINE len #-} len :: IO Int len = do xs <- toList $ (foldr mappend mempty $ map (\x -> Yield x Stop) [1..100000 :: Int]) return (length xs) main :: IO () main = defaultMain [ bench "len" $ nfIO len ] }}} When I'm measuring allocations I remove criterion imports and use this main: {{{ main = len >>= print }}} Note that I have a `NOINLINE` on `len` to avoid optimising it in the benchmark site. The original report does not have this. List.hs: {{{#!haskell module List where import Control.Monad (liftM) data List a = Stop | Yield a (List a) instance Semigroup (List a) where x <> y = case x of Stop -> y Yield a r -> Yield a (mappend r y) instance Monoid (List a) where mempty = Stop mappend = (<>) toList :: Monad m => List a -> m [a] toList m = case m of Stop -> return [] Yield a r -> liftM (a :) (toList r) }}} I have three configurations: - -O0 - -O1 - -O2 - -O0 -DSINGLE_MODULE - -O1 -DSINGLE_MODULE - -O2 -DSINGLE_MODULE I first run all these with `+RTS -s` using `main = len >>= print` as the main function. {{{ ============ -O0 =============================================================== 49,723,096 bytes allocated in the heap 25,729,264 bytes copied during GC 6,576,744 bytes maximum residency (5 sample(s)) 29,152 bytes maximum slop 13 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 41 colls, 0 par 0.011s 0.011s 0.0003s 0.0008s Gen 1 5 colls, 0 par 0.010s 0.010s 0.0020s 0.0047s INIT time 0.000s ( 0.000s elapsed) MUT time 0.011s ( 0.012s elapsed) GC time 0.021s ( 0.021s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.032s ( 0.033s elapsed) %GC time 64.0% (63.8% elapsed) Alloc rate 4,366,732,069 bytes per MUT second Productivity 35.6% of total user, 35.9% of total elapsed ============ -O1 =============================================================== 28,922,528 bytes allocated in the heap 18,195,344 bytes copied during GC 4,066,200 bytes maximum residency (5 sample(s)) 562,280 bytes maximum slop 13 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 22 colls, 0 par 0.008s 0.008s 0.0004s 0.0016s Gen 1 5 colls, 0 par 0.008s 0.008s 0.0016s 0.0029s INIT time 0.000s ( 0.000s elapsed) MUT time 0.009s ( 0.009s elapsed) GC time 0.016s ( 0.016s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.025s ( 0.025s elapsed) %GC time 63.8% (63.9% elapsed) Alloc rate 3,262,174,222 bytes per MUT second Productivity 35.3% of total user, 35.3% of total elapsed ============ -O2 =============================================================== 28,922,528 bytes allocated in the heap 18,195,344 bytes copied during GC 4,066,200 bytes maximum residency (5 sample(s)) 562,280 bytes maximum slop 13 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 22 colls, 0 par 0.008s 0.008s 0.0003s 0.0008s Gen 1 5 colls, 0 par 0.008s 0.008s 0.0017s 0.0029s INIT time 0.000s ( 0.000s elapsed) MUT time 0.008s ( 0.008s elapsed) GC time 0.016s ( 0.016s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.024s ( 0.024s elapsed) %GC time 66.6% (66.6% elapsed) Alloc rate 3,714,684,268 bytes per MUT second Productivity 32.7% of total user, 32.7% of total elapsed ============ -O0 -DSINGLE_MODULE =============================================== 49,723,032 bytes allocated in the heap 25,729,184 bytes copied during GC 6,576,728 bytes maximum residency (5 sample(s)) 29,152 bytes maximum slop 13 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 41 colls, 0 par 0.010s 0.010s 0.0003s 0.0008s Gen 1 5 colls, 0 par 0.010s 0.010s 0.0019s 0.0042s INIT time 0.000s ( 0.000s elapsed) MUT time 0.011s ( 0.011s elapsed) GC time 0.020s ( 0.020s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.031s ( 0.031s elapsed) %GC time 65.0% (65.0% elapsed) Alloc rate 4,609,752,610 bytes per MUT second Productivity 34.8% of total user, 34.8% of total elapsed ============ -O1 -DSINGLE_MODULE =============================================== 16,122,496 bytes allocated in the heap 7,392,664 bytes copied during GC 3,438,424 bytes maximum residency (4 sample(s)) 55,464 bytes maximum slop 7 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 10 colls, 0 par 0.004s 0.004s 0.0004s 0.0008s Gen 1 4 colls, 0 par 0.005s 0.005s 0.0012s 0.0019s INIT time 0.000s ( 0.000s elapsed) MUT time 0.004s ( 0.004s elapsed) GC time 0.009s ( 0.009s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.014s ( 0.014s elapsed) %GC time 66.5% (66.6% elapsed) Alloc rate 3,663,260,346 bytes per MUT second Productivity 32.5% of total user, 32.5% of total elapsed ============ -O2 -DSINGLE_MODULE =============================================== 13,722,496 bytes allocated in the heap 6,798,640 bytes copied during GC 2,158,376 bytes maximum residency (3 sample(s)) 33,248 bytes maximum slop 7 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 9 colls, 0 par 0.007s 0.007s 0.0008s 0.0021s Gen 1 3 colls, 0 par 0.004s 0.005s 0.0015s 0.0030s INIT time 0.000s ( 0.000s elapsed) MUT time 0.004s ( 0.004s elapsed) GC time 0.012s ( 0.012s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.016s ( 0.016s elapsed) %GC time 74.2% (74.3% elapsed) Alloc rate 3,479,572,009 bytes per MUT second Productivity 25.2% of total user, 25.2% of total elapsed }}} Summary: allocations consistently reduce as optimisation level increases. Secondly I run criterion benchmark to measure runtime, using the same configurations: {{{ ============ -O0 =============================================================== benchmarking len time 13.50 ms (13.23 ms .. 13.71 ms) 0.998 R² (0.997 R² .. 0.999 R²) mean 13.55 ms (13.35 ms .. 13.81 ms) std dev 613.5 μs (424.7 μs .. 918.2 μs) variance introduced by outliers: 18% (moderately inflated) ============ -O1 =============================================================== benchmarking len time 15.83 ms (15.62 ms .. 16.02 ms) 0.999 R² (0.998 R² .. 0.999 R²) mean 15.92 ms (15.75 ms .. 16.10 ms) std dev 463.5 μs (340.2 μs .. 669.1 μs) ============ -O2 =============================================================== benchmarking len time 15.70 ms (15.51 ms .. 15.90 ms) 0.999 R² (0.999 R² .. 1.000 R²) mean 15.74 ms (15.59 ms .. 15.87 ms) std dev 355.2 μs (271.2 μs .. 470.7 μs) ============ -O0 -DSINGLE_MODULE =============================================== benchmarking len time 14.85 ms (13.81 ms .. 16.06 ms) 0.976 R² (0.959 R² .. 0.997 R²) mean 13.60 ms (13.22 ms .. 14.14 ms) std dev 1.152 ms (773.1 μs .. 1.614 ms) variance introduced by outliers: 41% (moderately inflated) ============ -O1 -DSINGLE_MODULE =============================================== benchmarking len time 6.802 ms (6.702 ms .. 6.922 ms) 0.997 R² (0.994 R² .. 0.999 R²) mean 6.845 ms (6.765 ms .. 6.945 ms) std dev 261.8 μs (201.3 μs .. 336.8 μs) variance introduced by outliers: 18% (moderately inflated) ============ -O2 -DSINGLE_MODULE =============================================== benchmarking len time 6.614 ms (6.501 ms .. 6.712 ms) 0.998 R² (0.997 R² .. 0.999 R²) mean 6.399 ms (6.317 ms .. 6.472 ms) std dev 239.1 μs (201.7 μs .. 292.5 μs) variance introduced by outliers: 18% (moderately inflated) }}} So; - Everything works as expected in single module case. Both runtime and allocations get lower as optimisation level increases. - In multi-module -O1 and -O2 produce identical outputs, runtime difference is just noise. - In multi-module we get better allocations with -O1 vs. -O0, but runtime gets somewhat worse. This is what we should investigate. To see why we allocate less in multi-module with -O1 I compared the STG outputs (multi-module -O0 vs. multi-module -O1), the answer is fusion kicking in with -O1. We have an intermediate function application for `foldr mappend mempty` in -O0 output which disappears with -O1. Why does the runtime get worse? I don't know but I suspect it's just noise. Really the code is better (as in, it does less work) with -O1 than with -O0. I also compared single-module -O1 with multi-module -O1, the reason why single module is better is becuase the `toList` function is not inlined cross-module but it's inlined within the module. So I think in the compiled case there are no problems. Only remaining question is why GHCi is faster than compiled code. I've attached a tarball with my setup + outputs. It includes Core/STG outputs of all 6 configurations and criterion and +RTS -s outputs as well. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:27 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by osa1): * Attachment "t14208_.tar.xz" added. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):
So I think in the compiled case there are no problems.
OK good; that's reassuring. Do you know why the single-module case gets better? I suspect it may be that `toList` is specialised. If you add `{-# INLINABLE toList #-}` does the difference go away? Perhaps this isn't a big deal -- it's reasonable for single module to be faster -- but GHC does make real efforts NOT to penalise you for multi- module, so I'm curious.
Only remaining question is why GHCi is faster than compiled code.
Can you reproduce this difference? It is indeed puzzling! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:28 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Do you know why the single-module case gets better? I suspect it may be
#14208: Performance with O0 is much better than the default or with -O2, runghc
performs the best
-------------------------------------+-------------------------------------
Reporter: harendra | Owner: osa1
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 8.2.1
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by osa1):
that toList is specialised.
Yes, as also said in my previous comment, the reason is `toList` being
specialised only when it's in the same module.
Multi-module
{{{
Main.len1
:: GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Types.Int #)
[GblId, Arity=1, Str=, Unf=OtherCon []] =
[] \r [s_s5yc]
case Main.len_go 1# of sat_s5yd {
__DEFAULT ->
case List.toList GHC.Base.$fMonadIO sat_s5yd s_s5yc of {
(#,#) ipv_s5yf [Occ=Once] ipv1_s5yg [Occ=Once] ->
let {
sat_s5yi [Occ=Once] :: GHC.Types.Int
[LclId] =
[ipv1_s5yg] \u []
case GHC.List.$wlenAcc ipv1_s5yg 0# of
ww2_s5yh {
__DEFAULT -> GHC.Types.I# [ww2_s5yh];
};
} in (#,#) [ipv_s5yf sat_s5yi];
};
};
-- toList in another module
List.toList [Occ=LoopBreaker]
:: forall (m :: * -> *) a. GHC.Base.Monad m => List.List a -> m [a]
[GblId,
Arity=2,
Caf=NoCafRefs,
Str=,
Unf=OtherCon []] =
[] \r [$dMonad_s36Z m1_s370]
case m1_s370 of {
List.Stop -> GHC.Base.return $dMonad_s36Z GHC.Types.[];
List.Yield a1_s372 [Occ=OnceL] r_s373 [Occ=Once] ->
let {
sat_s377 [Occ=Once] :: [a_a1fh] -> m_a1fg [a_a1fh]
[LclId] =
[$dMonad_s36Z a1_s372] \r [x1_s375]
let {
sat_s376 [Occ=Once] :: [a_a1fh]
[LclId] =
CCCS :! [a1_s372 x1_s375];
} in GHC.Base.return $dMonad_s36Z sat_s376; } in
let {
sat_s374 [Occ=Once] :: m_a1fg [a_a1fh]
[LclId] =
[$dMonad_s36Z r_s373] \u [] List.toList $dMonad_s36Z
r_s373;
} in GHC.Base.>>= $dMonad_s36Z sat_s374 sat_s377;
};
}}}
Single module:
{{{
Main.len1
:: GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Types.Int #)
[GblId, Arity=1, Caf=NoCafRefs, Str=, Unf=OtherCon []] =
[] \r [s_s5uU]
case Main.len_go 1# of sat_s5uV {
__DEFAULT ->
case Main.len2 sat_s5uV s_s5uU of {
(#,#) ipv_s5uX [Occ=Once] ipv1_s5uY [Occ=Once] ->
let {
sat_s5v0 [Occ=Once] :: GHC.Types.Int
[LclId] =
[ipv1_s5uY] \u []
case GHC.List.$wlenAcc ipv1_s5uY 0# of
ww2_s5uZ {
__DEFAULT -> GHC.Types.I# [ww2_s5uZ];
};
} in (#,#) [ipv_s5uX sat_s5v0];
};
};
Main.len2 [Occ=LoopBreaker]
:: forall a.
Main.List a
-> GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, [a] #)
[GblId,
Arity=2,
Caf=NoCafRefs,
Str=,
Unf=OtherCon []] =
[] \r [m_s5uw eta_s5ux]
case m_s5uw of {
Main.Stop -> (#,#) [eta_s5ux GHC.Types.[]];
Main.Yield a1_s5uz [Occ=Once] r_s5uA [Occ=Once] ->
case Main.len2 r_s5uA eta_s5ux of {
(#,#) ipv_s5uC [Occ=Once] ipv1_s5uD [Occ=Once] ->
let {
sat_s5uE [Occ=Once] :: [a_X1Bw]
[LclId] =
CCCS :! [a1_s5uz ipv1_s5uD];
} in (#,#) [ipv_s5uC sat_s5uE];
};
};
}}}
If you add {-# INLINABLE toList #-} does the difference go away?
With `INLINE toList` or `INLINABLE toList` multi-module and single-module become identical with -O1 and -O2. (there are some small changes in the STG outputs but nothing that changes runtime or allocations)
Can you reproduce this difference? It is indeed puzzling!
I can, by running the criterion benchmark in GHCi: {{{ $ ghc-stage2 --interactive Main.hs GHCi, version 8.5.20180322: http://www.haskell.org/ghc/ :? for help Loaded GHCi configuration from /home/omer/rcbackup/.ghci [1 of 2] Compiling List ( List.hs, interpreted ) [2 of 2] Compiling Main ( Main.hs, interpreted ) Ok, two modules loaded. λ:1> main benchmarking len time 10.79 ms (10.68 ms .. 10.98 ms) 0.993 R² (0.979 R² .. 1.000 R²) mean 10.66 ms (10.54 ms .. 10.98 ms) std dev 480.8 μs (147.7 μs .. 952.1 μs) variance introduced by outliers: 20% (moderately inflated) λ:2> Leaving GHCi. $ ghc-stage2 --interactive Main.hs -DSINGLE_MODULE GHCi, version 8.5.20180322: http://www.haskell.org/ghc/ :? for help Loaded GHCi configuration from /home/omer/rcbackup/.ghci [1 of 1] Compiling Main ( Main.hs, interpreted ) Ok, one module loaded. λ:1> main benchmarking len time 11.30 ms (11.20 ms .. 11.42 ms) 0.999 R² (0.998 R² .. 0.999 R²) mean 10.77 ms (10.64 ms .. 10.90 ms) std dev 346.5 μs (317.7 μs .. 398.6 μs) }}} `-O1` and `-O2` with `-DSINGLE_MODULE` is faster than GHCi, but otherwise GHCi is faster than the other 4 configurations. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:29 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): Just updated the previous comment: `toList` is never inlined, but when it's in the same module as the using code or marked as `INLINABLE` it gets specialized. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:30 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): In case you need another data point, my original streaming library that made me file this issue still exhibits the same behavior. GHCi is 6x faster than my regular compiled code. I tried even compiling everything including all dependencies with exactly the same optimization flags to make sure there is no funny business due to mixing of opt flags. You can see the behavior in the tree available on github here: https://github.com/composewell/streamly/tree/199e20dd4b62ac2dafea0a40dc2ce3d... You can clone the repo and run the experiment like this: {{{ $ stack bench benchmarked streaming ops time 34.39 ms (32.99 ms .. 35.67 ms) 0.995 R² (0.991 R² .. 0.998 R²) mean 33.97 ms (33.24 ms .. 35.43 ms) $ stack runghc benchmark/Main.hs benchmarked streaming ops time 6.215 ms (5.684 ms .. 6.860 ms) 0.945 R² (0.896 R² .. 0.978 R²) mean 6.610 ms (6.333 ms .. 6.991 ms) }}} If I change the optimization flags to -O0 for benchmark stanza in cabal file I can get close to ghci performance. The code that I am benchmarking is like this: {{{ {-# INLINE streamlyOp #-} streamlyOp :: IO Int streamlyOp = do xs <- S.toList $ S.serially $ S.each [1..100000 :: Int] & fmap (+1) & fmap (+1) & fmap (+1) & fmap (+1) return (Prelude.length xs) }}} It seems the problem is with the `fmap` operation (I may be wrong), it is 6 times slower in case of GHC, and every other fmap I add, the benchmark timings increase but the ratio remains the same. I tried using an INLINE on fmap, I also tried to SPECIALIZE it to IO and INT type but no change. The `fmap` op is defined in `src/Streamly/Streams.hs` file like this: {{{ instance Monad m => Functor (StreamT m) where fmap f (StreamT (Stream m)) = StreamT $ Stream $ \_ stp yld -> let yield a Nothing = yld (f a) Nothing yield a (Just r) = yld (f a) (Just (getStreamT (fmap f (StreamT r)))) in m Nothing stp yield }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:31 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by mpickering): `fmap` in that module doesn't have an `INLINE` pragma on it? Should it? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:32 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): I tried using INLINE, and I tried SPECIALIZE as well but it made no difference. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:33 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by mpickering): I can reproduce this when all the dependencies are installed with `ghc benchmarks/Main.hs -isrc/ -O2` which is slow and `ghc benchmarks/Main.hs -isrc`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:34 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by mpickering): I noticed two things from a brief look. 1. `fmap` is not SATed. Which means the static `f` won't be eliminated. 2. A bad(?) loopbreaker is chosen in `Prelude.foldrM`, `yield` rather than `go`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:35 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by mpickering): Another possible answer is that your library has a lot of recursive functions in it and the base types are written in CPS which means things don't optimise too well. Again, this is not an answer as to why the optimiser makes the program slower. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:36 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by harendra): @mpickering is there something to read about how things do and don't get optimized with CPS especially with GHC? What are the pitfalls and why it does not optimize well? I am very much interested in this, any pointers will be appreciated. I was using the direct style before and I had some trouble with it, I had to use a lot of SPECIALIZE to IO types to extract decent performance wherever I was using monad polymorphic instances. The performance did not seem to be much better compared to CPS. Maybe I am wrongly attributing the problem to CPS-vs-Direct style and it might have been something stupid that I did, at that time. But I never had any similar problem in CPS style and then I never went back to direct. Though I want to try and see if direct would be any better but it will take some time and effort to do that. However, there is some proof that direct style is not very much different in that the "streaming" library performance is more or less the same as this library and "streaming" is actually more or less the same thing but written in direct style. See the performance comparison between "streaming" and "streamly" here: https://github.com/composewell/streaming- benchmarks/blob/master/charts-1/CheaperOperations.svg -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:37 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):
If I change the optimization flags to -O0 for benchmark stanza in cabal file I can get close to ghci performance.
That contradicts what Omer found in comment:27. Nevertheless, if what you say is true, it'd be easier to debug with -O0 than GHCi (which brings the bytecode generator into the picture).
GHCi is 6x faster than my regular compiled code
This is totally bonkers and we MUST find out what is happening :-). I suggest not getting diverted into speculation about CPS. We have a repro case; let's just dig into it and find out what is going on. My suggestions * In comment:31 Does the same thing happen with -O0 vs -O, or only with GHCi vs -O? * In all repros, do the huge differences also show up in the bytes- allocated numbers? (If so, we don't need the Criterion apparatus.) * I notice that in comment:27, in the 2-module case, comparing -O0 and -O1: * Allocation is about halved in -O1 * But runtime actually increases That is most peculiar. * Matthew says in comment:34 "I can reproduce this..". That's great. But what is "this" precisely? Which version of GHC? What timing data? What happened to allocation and GC numbers? Somehow a 6x increase in execution time ought not to be hard to find! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:38 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14208: Performance with O0 is much better than the default or with -O2, runghc performs the best -------------------------------------+------------------------------------- Reporter: harendra | Owner: osa1 Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by mpickering): I cloned the repo - https://github.com/composewell/streamly/tree/199e20dd4b62ac2dafea0a40dc2ce3d... Checked out commit - 199e20dd4b62ac2dafea0a40dc2ce3d97c307542 I could build all the dependencies with ghc-8.2.2. Then I ran in in the root of the repo {{{ ghc benchmarks/Main.hs -isrc/ -O2 }}} and then ran the executable. The benchmark numbers were slower than doing the same without `-O2`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14208#comment:39 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC