[GHC] #8763: forM_ [1..N] does not get fused (10 times slower than go function)

#8763: forM_ [1..N] does not get fused (10 times slower than go function) ------------------------------+-------------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: Runtime performance bug Unknown/Multiple | Test Case: Difficulty: Unknown | Blocking: Blocked By: | Related Tickets: | ------------------------------+-------------------------------------------- Apparently idiomatic code like {{{forM_ [1.._N]}}} does not get fused away. This can give serious performance problems when unnoticed. {{{ -- Slow: forM_ [0.._N-1] $ \i -> do ... -- Around 10 times faster: loop _N $ \i -> do ... {-# INLINE loop #-} loop :: (Monad m) => Int -> (Int -> m ()) -> m () loop bex f = go 0 where go !n | n == bex = return () | otherwise = f n >> go (n+1) }}} Full code example: https://gist.github.com/nh2/8905997 - the relevant alternatives are commented. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by nh2): People on #ghc think that it is because the {{{[1.._N]}}} gets floated out as a top-level constant expression: ---- '''nomeata:''' nh2: I’d consider that a bug '''nomeata:''' I believe the problem is that [0..512] does not depend on any local values '''nomeata:''' so it is floated out as a top-level value '''nomeata:''' and there it is not matched by any rules '''thoughtpolice:''' let floating strikes again '''thoughtpolice:''' (well, floating-out being a non-optimization, anyway.) '''Fuuzetsu:''' does this mean that if I use [0 .. 512] twice in a program in different places, it will only be computed once? '''hvr:''' Fuuzetsu: there's a chance, yes '''Fuuzetsu:''' neat '''thoughtpolice:''' well, not so neat. in cases like this you really don't want to float out some things, because it hurts later opportunities to optimize sometimes (e.g. float out a binding that otherwise would have triggered a RULE or fusion, perhaps) '''thoughtpolice:''' unfortunately floating like this is one of the harder things to 'fight against' when you don't want it, from what i've seen. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by darchon): I believe let-floating happens due to the full-laziness transform. Do you get better performance with -fno-full-laziness ? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by nomeata): A comment in `SetLevels` (which I just came across) in the code indicates that this problem should have been taken care of: {{{ -- We are keen to float something to the top level, even if it does not -- escape a lambda, because then it needs no allocation. But it's controlled -- by a flag, because doing this too early loses opportunities for RULES -- which (needless to say) are important in some nofib programs -- (gcd is an example). }}} So either my assumption is wrong, or this does not work as desired. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by nomeata): It turns out that this comment is obsolete; the flag is never set. I quote from `SimplCore` {{{ -- Was: gentleFloatOutSwitches -- -- I have no idea why, but not floating constants to -- top level is very bad in some cases. -- -- Notably: p_ident in spectral/rewrite -- Changing from "gentle" to "constantsOnly" -- improved rewrite's allocation by 19%, and -- made 0.0% difference to any other nofib -- benchmark }}} This comment was introduced in eaeca51efc0be3ff865c4530137bfbe9f8553549 (2009) by SPJ. Maybe rules matching should look though unfoldings more easily (at the risk of losing sharing)? There is no point in worrying about sharing `[0..N]` in a rule application whose purpose is to eliminate that list. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by nh2): @nomeata Regarding your suspicion that it gets floated out as a constant, I don't see an improvement when getting the upper bound m of `[1..m]` from an IO action. What do you think? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by nomeata): It might still get floated out of some local function. Have you tried coming up with a minimal example? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by nh2): @nomeata There is an example I made for this, mentioned in the bug description. The performance I measure for that is: * using `forM_` with `ghc -O`: 2.0 s * using `loop ` with `ghc -O`: 1.6 s * using `forM_` with `ghc -O2`: 0.9 s * using `loop ` with `ghc -O2`: 0.3 s * using `forM_` with `ghc -O2 -fllvm`: 0.75 s * using `loop ` with `ghc -O2 -fllvm`: 0.15 s I tried to make an even smaller benchmark (https://gist.github.com/nh2/11333427) but the performance is identical there although the same thing changes as before. Could you try my two benchmarks and see if you get the same behaviour? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by nh2): I have updated the gist at https://gist.github.com/nh2/11333427 to contain both the matmult example (where the difference between `forM_` and `loop` is big) and the simple example (where no difference can be measured). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by daniel.is.fischer): Replying to [comment:8 nh2]:
I have updated the gist at https://gist.github.com/nh2/11333427 to contain both the matmult example (where the difference between `forM_` and `loop` is big) and the simple example (where no difference can be measured).
The simple example doesn't use the same list in different places, so GHC is capable of eliminating it and giving you a loop on unboxed `Int#`s, at least with `-O2`. In the matmult example, you need to conceal the fact that both lists are the same from GHC to get a loop on unboxed `Int#`s. So in principle, GHC can do the desired thing, just the sharing gets in the way. Can somebody think of a situation where sharing is beneficial for `forM_ [a .. b] $ \n -> do ...` code? If not, perhaps special-casing `enumFromTo` arguments for `forM_` etc. is, at least for standard integer types, something to be considered. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by nh2): Preventing it from sharing sounds sensible for me: If the `[a .. b]` was something expensive to compute (a list of primes, say), I suspect any sane person would easily share it manually by declaring it top-level. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by nh2): Replying to [comment:9 daniel.is.fischer]:
The simple example doesn't use the same list in different places
Unfortunately, I still get no difference in performance even if I duplicate the loops to forM_ [1.._MAX] $ \i -> do UM.unsafeWrite v 0 i forM_ [1.._MAX] $ \i -> do UM.unsafeWrite v 0 i Any idea? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Changes (by pivo): * cc: pivo@… (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by simonpj): I've got lost in this thread. If someone thinks there is a real bug here, in GHC 7.8, could you re-articulate what you think it is? Simon -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by nh2): @simonpj: Summary: 1) I reported that my manually written loop is much faster than `forM_ [1..n]` in some cases, suggesting that in some cases optimizing the list away doesn't work well. 2) nomeata said some technical things that are a bit beyond me. 3) I submit two benchmarks in the gist at https://gist.github.com/nh2/11333427, a "matmult" benchmark where there is a big difference between `forM_` and the hand-written `loop`, and a "simple" benchmark where they are equally fast. 4) Daniel suspects the slow case comes from using the same syntactical list twice, and that in this case GHC floats it out to share it, which breaks eliminating it. He suggests we might special-case `enumFromTo` when used with `forM_` to prevent it. 5) I give a counter example for his suspicion, by changing my "simple" benchmark, where using the same list twice gives the same good performance as using it once. I get the same behaviour for 7.6 and 7.8. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by daniel.is.fischer): Slight correction, @Niklas, it's not a suspicion that it's the floating out of the list to the top-level and consequently the use of the list for looping instead of unboxed `Int#`s, that is direct from the core (`-ddump- simpl` is your friend in all matters related to performance). The question is under which exact circumstances GHC floats the list out. To answer that, you need somebody knowing how GHC works. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by simonpj): I've tried with 7.8.2. I get the same performance for `matmultForM_` and `matMultLoop`. I'm not using criterion; just using `+RTS -s`, with `_SIZE = 300`. So I can't reproduce the problem. Simon -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) --------------------------------------------+------------------------------ Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by daniel.is.fischer): Replying to [comment:16 simonpj]:
I've tried with 7.8.2. I get the same performance for `matmultForM_` and `matMultLoop`.
Are you compiling with `-O`? You need `-O2` for a significant difference to appear (with only `-O`, both versions are slow). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by George): With 7.10.1 I see a factor of 2 difference in performance: benchmarking matmultForM_ time 10.90 μs (10.89 μs .. 10.91 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 10.89 μs (10.89 μs .. 10.91 μs) std dev 32.72 ns (18.98 ns .. 65.42 ns) benchmarking matmultLoop time 5.404 μs (5.387 μs .. 5.419 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 5.409 μs (5.398 μs .. 5.420 μs) std dev 37.99 ns (33.64 ns .. 44.26 ns) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by George): * milestone: => 7.12.1 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by George): * cc: george (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by George): seeing same issue and performance on 7.10.3. This is on MacOS but I believe the problem occurs on multiple platforms -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.0.2 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by George): * milestone: => 8.0.2 Comment: still seeing the same problem on 8.0.1. Any chance of this getting fixed in 8.0.2? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:24 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * milestone: 8.0.2 => 8.2.1 Comment: I'm afraid 8.0.2 is out of the question. However, 8.2.1 is a possibility. Of course, any help that you, the reader, can offer would expedite the process. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:25 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by akio): * cc: akio (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:26 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by akio): * Attachment "rep.hs" added. Test case -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by akio): Here is a test case that does not require criterion. With `ghc-8.0.1 -O2 -ddump-simpl`, you can see that `foo` is compiled into code that uses a top-level CAF containing `GHC.Enum.eftInt 0# 799#`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:27 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by bgamari: @@ -6,1 +6,1 @@ - {{{ + {{{#!hs @@ -16,3 +16,3 @@ - where - go !n | n == bex = return () - | otherwise = f n >> go (n+1) + where + go !n | n == bex = return () + | otherwise = f n >> go (n+1) New description: Apparently idiomatic code like {{{forM_ [1.._N]}}} does not get fused away. This can give serious performance problems when unnoticed. {{{#!hs -- Slow: forM_ [0.._N-1] $ \i -> do ... -- Around 10 times faster: loop _N $ \i -> do ... {-# INLINE loop #-} loop :: (Monad m) => Int -> (Int -> m ()) -> m () loop bex f = go 0 where go !n | n == bex = return () | otherwise = f n >> go (n+1) }}} Full code example: https://gist.github.com/nh2/8905997 - the relevant alternatives are commented. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:28 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by akio): It looks like this is the same issue as described in #7206. The branch mentioned in https://ghc.haskell.org/trac/ghc/ticket/7206/#comment:13 fixes my test case. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:29 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Good to know that's it. The program in rep.hs is rather unusual because it has a ''constant'' list, not dependent on input data. Real programs usually depend on input data. So this probem may be less of a probem in practice than in benchmarks. It would be great if someone felt able to dig into #7206 and figure out why it's not a straight win Simon -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:30 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by nh2): Replying to [comment:30 simonpj]:
Real programs usually depend on input data. So this probem may be less of a probem in practice than in benchmarks.
Just wanted to chime in that I found this problem in a real-world program, where I looped over all 32-bit integers. Similarly, these known-at- compile-time iteration lists may appear in inner loops of algorithms or data structures (like B-trees with fixed B, or image convolution kernels with fixed size, e.g. 3x3 image templates). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:31 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by osa1): * related: => #7206 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:34 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): I confirmed that #7206 fixes this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:35 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: infoneeded Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by osa1): * status: new => infoneeded Comment: Sorry for the confusion -- it turns out GHC 8.2.2 already optimizes this. Perhaps this can be closed or we may need a new reproducer. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:36 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: infoneeded Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by nh2): If the problem is gone in 8.2.2, do we know what commit fixed it? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:37 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: infoneeded Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by akio): I haven't confirmed, but it looks like 2effe18ab51d66474724d38b20e49cc1b8738f60 may have fixed this. Now fusion happens during the "gentle" phase of simplifier, which happens before any of the float-out passes. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:38 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Sorry for the confusion -- it turns out GHC 8.2.2 already optimizes
#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: infoneeded Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by George): Replying to [comment:36 osa1]: this. Perhaps this can be closed or we may need a new reproducer. I have a very similar program where, with 8.4.1, the forM_ version allocates 50% more than the version with recursive go but both versions run in about the same time . Is it worth giving that code here in this bug or should I enter a new bug referencing this one? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:39 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: infoneeded Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1):
I have a very similar program where, with 8.4.1, the forM_ version allocates 50% more than the version with recursive go but both versions run in about the same time . Is it worth giving that code here in this bug or should I enter a new bug referencing this one?
What is the first argument of `forM_` in you program? This ticket (and #7206) is specifically about `forM_`/`mapM_` when the foldable argument is of form `[n .. m]`. If your program is different than perhaps a new ticket would be better. In any case, it's not a big deal if you paste your program here, we can move it to a new ticket if it turns out to be something different than what we try to improve here. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:40 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: infoneeded Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by George): * Attachment "fuse.hs" added. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: infoneeded Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by George): * cc: george (removed) * cc: George (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:41 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: infoneeded Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by George): Attached file fuse.hs , similar to the first file but with nested forM_, both versions run in the same time but the forM_ version allocates about 50% more {{{ ghc -O2 fuse.hs +RTS [1 of 1] Compiling Main ( fuse.hs, fuse.o ) Linking fuse ... bash-3.2$ ./fuse 1 +RTS -s 486341683267690 320,057,352 bytes allocated in the heap 8,232 bytes copied during GC 44,576 bytes maximum residency (1 sample(s)) 29,152 bytes maximum slop 308 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.025s 0.0254s 0.0254s INIT time 0.000s ( 0.003s elapsed) MUT time 5.432s ( 5.634s elapsed) GC time 0.000s ( 0.025s elapsed) EXIT time 0.000s ( 0.005s elapsed) Total time 5.433s ( 5.667s elapsed) %GC time 0.0% (0.4% elapsed) Alloc rate 58,918,474 bytes per MUT second Productivity 100.0% of total user, 99.5% of total elapsed bash-3.2$ ./fuse 2 +RTS -s 486341683267690 560,057,328 bytes allocated in the heap 15,992 bytes copied during GC 320,028,576 bytes maximum residency (2 sample(s)) 868,448 bytes maximum slop 308 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 228 colls, 0 par 0.001s 0.002s 0.0000s 0.0001s Gen 1 2 colls, 0 par 0.000s 0.026s 0.0128s 0.0254s INIT time 0.000s ( 0.003s elapsed) MUT time 5.453s ( 5.630s elapsed) GC time 0.002s ( 0.027s elapsed) EXIT time 0.000s ( 0.008s elapsed) Total time 5.455s ( 5.667s elapsed) %GC time 0.0% (0.5% elapsed) Alloc rate 102,698,216 bytes per MUT second Productivity 100.0% of total user, 99.5% of total elapsed }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:42 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by George): * status: infoneeded => new Comment: Hoping my example is the new reproducer. If not this can be set back to infoneeded or closed. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:43 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Here is a smaller example that highlights the problem without vectors. The only difference between the two functions is the use of `[2,3..n]` instead of `[2..n]`, which desugar to different functions. This results in a difference in a huge difference in allocation as well as runtime: {{{ $ ./repro 2 +RTS -s # [2,3..n] () 960,056,856 bytes allocated in the heap 21,536 bytes copied during GC 44,576 bytes maximum residency (2 sample(s)) 29,152 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 918 colls, 0 par 0.005s 0.003s 0.0000s 0.0000s Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0002s INIT time 0.000s ( 0.000s elapsed) MUT time 0.123s ( 0.125s elapsed) GC time 0.005s ( 0.003s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.129s ( 0.129s elapsed) %GC time 4.1% (2.5% elapsed) Alloc rate 7,778,808,106 bytes per MUT second Productivity 95.8% of total user, 97.4% of total elapsed }}} {{{ $ ./repro 1 +RTS -s # [2..n] () 56,872 bytes allocated in the heap 3,480 bytes copied during GC 44,576 bytes maximum residency (1 sample(s)) 25,056 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s INIT time 0.000s ( 0.000s elapsed) MUT time 0.048s ( 0.048s elapsed) GC time 0.000s ( 0.000s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.048s ( 0.048s elapsed) %GC time 0.2% (0.2% elapsed) Alloc rate 1,188,432 bytes per MUT second Productivity 99.6% of total user, 99.6% of total elapsed }}} This happens in `ST`, but not in `IO`, so probably related to some hack. Also the difference vanishes when we allow the functions to inline. Here's some Core for `g` (the offending function): {{{ -- RHS size: {terms: 235, types: 242, coercions: 61, joins: 4/13} $wg $wg = \ @ s ww w -> let { $wc = <huge> } in case <# ww 3# of { __DEFAULT -> let { y' y' = -# ww 1# } in letrec { go_up go_up = \ x eta -> case ># x y' of { __DEFAULT -> $wc x ((go_up (+# x 1#)) `cast` Co:4) eta; 1# -> $wc x (lvl `cast` Co:4) eta }; } in $wc 2# ((go_up 3#) `cast` Co:4) w; 1# -> case <# ww 2# of { __DEFAULT -> $wc 2# (lvl `cast` Co:4) w; 1# -> (# w, () #) } } }}} From my understanding of join points, `$wc` is only nearly a join point, because `go_up` with its transitive tail call to `$wc` appears in argument position. It would be great if we could get rid of this! The `IO` variant (`g 40000000 >>= print`) doesn't have this weakness, it's all join points there. Hence my suspicion about some `IO` hack that let's GHC eta-expand stuff more aggresively, but I'm not sure how that's helping. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:44 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by sgraf): * Attachment "repro.hs" added. Reproduction for comment:44 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): It seems that for `IO`, GHC decides that it's OK to inline `c` from the [https://hackage.haskell.org/package/base-4.11.0.0/docs/src/GHC.Enum.html#efd... fusion helper of enumFromThenTo], but not so for `ST s`. For our case, `c` is the `<huge>` computation (see the worker `$wc` in comment:44) performed for each outer list element and would be duplicated by inlining: It's mentioned thrice in the definition of `efdtIntUpFB`. Consequently, `c` has almost always `Guidance=NEVER`, except in the `IO` case, where it miraculously gets `Guidance=IF_ARGS [20 420 0] 674 0` just when it is inlined. Not sure what this decision is based on. The inlining decision for `eftIntFB` is much easier: `c` [https://hackage.haskell.org/package/base-4.11.0.0/docs/src/GHC.Enum.html#eft... only happens once there]. I'm not sure if `IO` gets special treatment by the inliner, but I see a few ways out: * Do the same hacks for `ST`, if there are any which apply (ugly) * Reduce the number of calls to `c` in the implementation of `efdtIntUpFB`, probably for worse branch prediction * Figure out why the floated out expression of `\x -> (nop x *>)` occuring in `forM_ nop = flip mapM_ nop = foldr ((>>) . nop) (return ())` doesn't get eta-expanded in the `ST` case, whereas the relevant `IO` code is. I hope that by fixing this, the `c` expression inlines again. Here's how it inlines for `IO`: {{{ (>>) . nop = \x -> (nop x >>) = \x -> (nop x *>) -- notice how it's no different than ST up until here = \x -> (thenIO (nop x)) }}} The inliner probably stops here, but because of eta-expansion modulo coercions to `\x k s -> thenIO (nop x) k s`, we can inline [https://hackage.haskell.org/package/base-4.11.0.0/docs/src/GHC.Base.html#the... thenIO]: {{{ \x k s -> thenIO (nop x) y s = \x k s -> case nop x s of (# new_s, _ #) -> k new_s) }}} which is much better and probably more keenly inlined than `\x -> (nop x *>)` in the `ST` case. What makes GHC eta-expand one, but not the other? This is just a wild guess and the only real difference I could make out in diffs. Maybe someone with actual insights into the simplifier can comment on this claim (that the inliner gives up on `c` due to the missed eta- expansion and inlining)? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:45 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by sgraf): * cc: sgraf (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:46 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by George): Your example seems to be different than mine in that when I compile yours with ghc 8.4.1 and -O there is no difference in allocation unlike mine. What flags did you compile with? {{{ ghc -O repro.hs +RTS [1 of 1] Compiling Main ( repro.hs, repro.o ) Linking repro ... bash-3.2$ ./repro 1 +RTS -s () 56,880 bytes allocated in the heap 3,480 bytes copied during GC 44,576 bytes maximum residency (1 sample(s)) 25,056 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s INIT time 0.000s ( 0.002s elapsed) MUT time 0.051s ( 0.052s elapsed) GC time 0.000s ( 0.000s elapsed) EXIT time 0.000s ( 0.003s elapsed) Total time 0.051s ( 0.057s elapsed) %GC time 0.3% (0.4% elapsed) Alloc rate 1,118,716 bytes per MUT second Productivity 99.3% of total user, 95.6% of total elapsed bash-3.2$ ./repro 2 +RTS -s () 56,880 bytes allocated in the heap 3,480 bytes copied during GC 44,576 bytes maximum residency (1 sample(s)) 25,056 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s INIT time 0.000s ( 0.002s elapsed) MUT time 0.051s ( 0.051s elapsed) GC time 0.000s ( 0.000s elapsed) EXIT time 0.000s ( 0.005s elapsed) Total time 0.051s ( 0.059s elapsed) %GC time 0.3% (0.3% elapsed) Alloc rate 1,120,655 bytes per MUT second Productivity 99.3% of total user, 95.7% of total elapsed }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:47 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Yuck! I was under the impression I used GHC 8.4.1 via a `nix-shell` when in reality I was using another locally installed 8.2.2. So, back to minimization, I guess. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:48 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): It seems I uploaded the variant where I used `IO` instead of `ST`, where things still inline. When you substitute `ST s` for `IO` and use `print $ runST $ ...` instead of `... >>= print`, it should reproduce with 8.4.1. Now here's the funny part: I managed to make this reproduce even for `IO` by duplicating the call to `nop`. So it seems like `c` really just hits the threshold where the inliner gives up. The only solution I can think of is what I described in my second point above: Implement `efdtIntUpFB` in a way that doesn't duplicate `c`. In general we should avoid to call `c` in `build`s more than once because of scenarios like this. Huge `c`s aren't uncommon at all (do blocks in `forM_` bodies, the functions passed as first argument to `foldr`, etc.) and otherwise we can't guarantee that everything inlines. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:49 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): I can't attach the fixed reproduction, so use this gist instead: https://gist.github.com/sgraf812/6089d81fbc95af9c5f817ff9dc417401 I'll see if I can craft a variant of `efdtInt*` that doesn't duplicate `c`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:50 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by nomeata):
In general we should avoid to call c in builds more than once because of scenarios like this.
I vaguely having seen a Note about this somewhere, but I cannot find it right now. But yes, a single occurrence of `c` is beneficial. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:51 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): This definition of `efdtIntUpFB` only has a single occurence of `c` and `n` and consequently fixes th issue. But this probably doesn't have the same performance for the non-fused case. {{{ data CounterState = More | Last | End -- Requires x2 >= x1 {-# INLINE [0] efdtIntUpFB #-} -- See Note [Inline FB functions] in GHC.List efdtIntUpFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r efdtIntUpFB c n x1 x2 y = -- Be careful about overflow! let !first_state | isTrue# (y <# x2) = if isTrue# (y <# x1) then End else Last | otherwise = More -- Common case: x1 <= x2 <= y !delta = x2 -# x1 -- >= 0 !y' = y -# delta -- x1 <= y' <= y; hence y' is representable next_state End _ = End next_state Last _ = End next_state More x | isTrue# (x ># y') = Last | otherwise = More -- Invariant: x <= y -- Note that: z <= y' => z + delta won't overflow -- so we are guaranteed not to overflow if/when we recurse emit End _ = n emit st x | let x' = x +# delta = I# x `c` emit (next_state st x') x' in emit first_state x1 }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:52 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): @nomeata: Ah, right before my nose. Very reassuring. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:53 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): How about this (untested), which seems a lot simpler {{{ efdtIntUpFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r efdtIntUpFB c n x1 x2 y -- Be careful about overflow! | isTrue# (y <# x1) = n | otherwise = go_up x1 -- Common case: x1 <= y where !delta = x2 -# x1 -- >= 0 !y' = y -# delta -- x1 <= y' <= y; hence y' is representable -- Invariant: x <= y -- Note that: x <= y' => x + delta won't overflow -- so we are guaranteed not to overflow if/when we recurse go_up x = I# x `c` if isTrue# (x ># y') then n else go_up (x +# delta) }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:54 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): I just ran some quick `quickCheck $ withMaxSuccess 100000 $ \i j k -> take 1000 [i,j..k] == take 1000 (efdtInt (unI# i) (unI# j) (unI# k))` tests and both versions pass. Given that simonpj's is much more to the point, let's run with that one! Although the duplicate `n` has potential to cause pain... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:55 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Nofib suggests that this regresses allocations in `integer` by 6.0% and counted instructions by 0.1%. I had a look at the simplified Core and it seems that it's entirely due to the new definition, although I'm not sure where exactly this allocates more. Maybe it's due to an increase in closure size of `go_up` because of `single`?. Here's the [https://www.diffchecker.com/FrxIUoRQ Core diff] and the [https://github.com/sgraf812/ghc/blob/cf4c1a52916fbf1b6acadd9a2477672b876a860... new definition of efdtIntUpFB for reference]. It seems that `c` is still not inlined, even with the new definition. I assume that's because there are multiple occurences of `c` which were probably duplicated before the inliner had a chance to inline the argument `c`. It better had introduced a join point before. Maybe loopification helps here? Indeed [https://ghc.haskell.org/trac/ghc/ticket/14068#comment:47 #14068] suggests that something beneficial happens, maybe more so with this patch. Or we could introduce some kind of annotation mechanism to tell GHC to be careful not to duplicate occurences of certain parameters that occur once (`f {-# HUGE #-} c n = ...`). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:56 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): I ''really'' don't want to add new annotations! Would it be possible to distil out a small example of what goes wrong with `integer`? It may not be fixable, but it's always worth a try. I suppose the gain here is important... it's a fairly common case. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:57 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): I dug through `dump-simpl-iterations` and noticed that the duplication happens through `efdtIntFB`s unfolding, which mentions `efdtIntUpFB` ''and'' `efdtIntDnFB`, which both want to inline their `c` later on. {{{ efdtIntFB = \ (@ r_a3cz) (c_a2ho :: Int -> r_a3cz -> r_a3cz) (n_a2hp :: r_a3cz) (x1_a2hq :: Int#) (x2_a2hr :: Int#) (y_a2hs :: Int#) -> case >=# x2_a2hr x1_a2hq of { __DEFAULT -> efdtIntDnFB @ r_a3cz c_a2ho n_a2hp x1_a2hq x2_a2hr y_a2hs; 1# -> efdtIntUpFB @ r_a3cz c_a2ho n_a2hp x1_a2hq x2_a2hr y_a2hs } }}} I'll see what we can do about that tomorrow... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:58 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Note that's not a problem for `[x..y]` (`eftInt`), because that doesn't need to consider counting down. It's not an issue for `[x,y..]` (`edtInt`), because although it calls `efdtInt{Up,Dn}` internally, it doesn't take part in fusion at all (is that an oversight or by design?). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:59 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Here's an implementation of `efdtIntFB` that fits our requirements: {{{ data Direction = Up | Dn deriving Eq direction :: Int# -> Int# -> Direction direction from to | isTrue# (to >=# from) = Up | otherwise = Dn efdtIntFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r efdtIntFB c n x1 x2 y = emit first x1 where -- We can safely emit the first element if an iteration -- 'moves closer' to @y@. That's exactly the case when -- @dir_x2@ coincides with @dir_y@. !first = dir_x2 == dir_y !dir_x2 = direction x1 x2 !dir_y = direction x1 y -- We need the overflow flag in 'emit'. (# delta, delta_ovf #) = x2 `subIntC#` x1 -- | Think of @emit :: Maybe Int -> [Int]@, only unboxed. -- If the argument is 'Nothing', we reached the end of the list. -- If the argument is 'Just', we emit an element, compute -- the next candidate, validate it and recurse. emit False _ = n emit True x = I# x `c` emit next_ok next where -- Check that @next@ didn't move past @y@. -- Also, overflow is only allowed iff the computation for -- @delta@ overflowed. (# next, next_ovf #) = addIntC# x delta !next_ok = isTrue# (next_ovf ==# delta_ovf) && direction next y == dir_y -- TODO: evaluate strict && for branchless code {-# INLINE[0] efdtIntFB #-} }}} Some pros: - I find this much easier to understand. No complicated invariants, etc. - No Up/Dn variants to maintain. Still, if the direction is statically known, constant folding and inlining will simplify stuff to the equivalent code. - As a result, no more duplication of `c` occurrences - Also no more duplication of `n` occurrences Cons: - `emit`s closure is 4 words big (2 words bigger than the closure of the original `go_up` helper) in the most general form. It's unfortunate that we can't pack together `dir_y` and `delta_ovf` into a single word without confusing constant folding. This would need either some kind of constant propagation through bit fields (out of scope for GHC, I think) or a smarter closure allocation theme that packs together non-pointer payloads. - We pay for the generalisation of Up/Dn variants by having to compare with `dir_y` all the time. - `base` lacks `addWordC#` primitives, which I'll probably add now -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:60 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Looks plausible to me, but needs a careful Note to explain the issues. But before we go too far with this, I'd like to point to [wiki:LateLamLift late lambda lifting]. In the core reported in comment:44, all we need to do is lambda-lift `$wc` and `go_up` to top level, and all will be well, I claim. And that is precisely what late-lambda lifting does. And the result might be faster than the very careful code above, because of the extra argument passing and case-tesing it has to do. To me LLF is low-hanging fruit. There are promising results described on the wiki page, and the whole join-point business eliminates its principal shortcoming. I wonder if, before going ahead with this somewhat-delicate `efdtIntFB` business, it might be fun to re-awaken LLF? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:61 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Looks plausible to me, but needs a careful Note to explain the issues.
But before we go too far with this, I'd like to point to [wiki:LateLamLift late lambda lifting]. In the core reported in comment:44, all we need to do is lambda-lift `$wc` and `go_up` to top level, and all will be well, I claim. And that is precisely what late- lambda lifting does. And the result might be faster than the very careful code above, because of the extra argument passing and case-tesing it has to do.
To me LLF is low-hanging fruit. There are promising results described on
#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Replying to [comment:61 simonpj]: the wiki page, and the whole join-point business eliminates its principal shortcoming.
I wonder if, before going ahead with this somewhat-delicate `efdtIntFB`
business, it might be fun to re-awaken LLF? I'll give it a try. Without understanding all operational consequences of LLF, I'd still guess that making sure all `c`s inline would be more beneficial in this scenario. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:62 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):
I'll give it a try. Without understanding all operational consequences of LLF, I'd still guess that making sure all cs inline would be more beneficial in this scenario.
In general inlining is good. But in the case on this thread, I think it's the non-allocation of a function closure that saves all the work, rather than any optimisations that happen after inlining c. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:63 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): I rebased `wip/llf` onto a recent HEAD commit. You can find my progress here: https://github.com/sgraf812/ghc/tree/llf. Currently, it doesn't pass the testsuite (even in default mode, which doesn't do any new lambda lifting), probably because I introduced some errors during the merge. I think we should continue the discussion in #9476. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:64 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function)
-------------------------------------+-------------------------------------
Reporter: nh2 | Owner: (none)
Type: bug | Status: new
Priority: normal | Milestone: 8.6.1
Component: Compiler | Version: 7.6.3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #7206 | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by George): As this is now 4 years old, does it make sense to open a new issue and close this one? My problem of 3 years ago is basically fixed. In any case I have uploaded a fixed version, reprof.hs of repro,hs which was modifed per the comments in 49 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:67 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by George): * Attachment "reprof.hs" added. fixed version of repro, modified per comment 49 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by George): updated summary to match current state of this ticket. The speed issue seems to have been resolved. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:68 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Thanks for the curation, George! Now that I spent some time with late lambda lifting and have an implementation at hand, I revisited this issue. Indeed, the original `$wg` binding from comment:44 gets lifted to top- level, but there is no reduction in allocation to be had (it's just swapping each mention of the local binding in closures for its single free variable, after all) and instructions executed are roughly the same. That's because part of the original problem persists: The hot loop `go_up` is still not a join point. I still think the implementation from comment:60 is the way to go. With the improvements to constant folding in Phab:D4605, the mentioned increase in closure size should be constant-folded away in the majority of cases. I'll conduct some benchmarks next week. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:69 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by George): That sounds great! Thanks to you an everybody else who has improved this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:70 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): It turns out that the implementation from comment:60 doesn't get rid of the problem. It seems that the hard earned single call to `c` in `emit True x = I# x \`c\` emit next_ok next` gets duplicated because of case-of- case. The relevant Core began as this expression: {{{ case (case ==# next_ovf_a3hz delta_ovf_a3h0 of lwild_s4fL { __DEFAULT -> GHC.Types.False 1# -> b_a2jS }) of next_ok_a2k8 { __DEFAULT -> c_a2jU (GHC.Types.I# ds_d42Y) (emit_a3hf next_ok_a2k8 next_a3hx) } }}} Now case-of-case comes along and immediately simplifies this to {{{ case ==# next_ovf_a3hz delta_ovf_a3h0 of { __DEFAULT -> case b_a2jS of { __DEFAULT -> c_a2jU (GHC.Types.I# ds_d42Y) (emit_a3hf GHC.Types.False next_a3hx) }; 1# -> case b_a2jS of next_ok_a2k8 { __DEFAULT -> c_a2jU (GHC.Types.I# ds_d42Y) (emit_a3hf next_ok_a2k8 next_a3hx) } } }}} I'm not sure if the intermediate join point is never generated or is just inlined immediately, but I'd very much like this not to duplicate the call to `c`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:71 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): This happens even in `-O0`. Not sure how to turn case-of-case of (for debugging reasons), `-fno-case-of-case` dosn't seem to be a thing anymore. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:72 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): What, exactly is "the problem"? You say in comment:69 that something is not a join point, and LLF helps a bit but not enough. Can you give a from-scratch explanation of what the problem is, perhaps assuming that LLF is available if necessary? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:73 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Sure will do. Let's begin with the original problem. There's two functions in [[attachment:reprof.hs]]. One is `f`, having a loop of the `forM_ [2..n] body` form, the other is `g` which has the `forM_ [2,3..n] body` form. The former doesn't allocate (56kB in total), whereas the latter allocates quite a lot (960MB). Why is that? The arithmetic sequences desugar, rewrite and specialise to `build (\c t -> eftIntFB c z 2 n)` and `build (\c z -> efdtIntFB c z 2 3 n)`, respectively. That cancels away with `forM_`'s implementation in terms of `foldr`: {{{ forM_= flip mapM_ mapM_ body = {- definition -} foldr ((>>) . body) (return ()) = {- eta expand the section -} foldr (\x k -> body x >> k) (return ()) = {- (>>) of ST, written as `s -> (a, s)` for lighter syntax -} foldr (\x k s -> case (body x s) of (_, s') -> k s') (return ()) }}} Note how `k` occurs in tail position within the lambda. Now, this cancels with the definition of `ef(d)tIntFB`: {{{ foldr (\x k s -> case (body x s) of (_, s') -> k s') (return ()) . build (\c z -> efdtIntFB c z 2 3 n) = {- foldr/build -} efdtIntFB (\x k s -> case (body x s) of (_, s') -> k s') (return ()) 2 3 n }}} So, that lambda is what is bound to `c` when `ef(d)tIntFB` gets inlined. Now, the implementation of `eftIntFB` only has a single occurrence of `c`, so it will inline properly and bind its `k` parameter (the continuation in tail position) to the recursive `go` call here [[https://github.com/ghc/ghc/blob/fa3143c76ac77ee96fd89559cacc089205abaa20/lib...]]. The call to `k` turns into a tail call within `go`s body: {{{ go x = I# x `c` if isTrue# (x ==# y) then n else go (x +# 1#) ==> c = \x k s -> case (body x s) of (_, s') -> k s' go x s = case (body (I# x) s) of (_, s') -> if isTrue# (x ==# y) then n s' else go (x +# 1#) s' }}} And `go` can be made a join point. The same isn't possible in the current `efdtIntFB`, because it duplicates `c` by branching on whether to count up or down (and also within the loop itself, anyway): {{{ efdtIntFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r efdtIntFB c n x1 x2 y | isTrue# (x2 >=# x1) = efdtIntUpFB c n x1 x2 y | otherwise = efdtIntDnFB c n x1 x2 y }}} Now, when `efdtInt{Up,Dn}FB` gets inlined, we end up with a let-bound `c` that still tail-calls its argument `k`, and is even tail-called itself within `go_up`/`go_dn` (here https://github.com/ghc/ghc/blob/fa3143c76ac77ee96fd89559cacc089205abaa20/lib...): {{{ let c x k s = case (body x s) of (_, s') -> k s' in ... let go_up x | isTrue# (x ># y') = I# x `c` n | otherwise = I# x `c` go_up (x +# delta) }}} Note that `go_up` tail calls `c` and passes itself as the `k` parameter. If `c` was inlined, all would be fine and `go_up` would turn into a join point. That's not the case because `c` is duplicated in `efdtIntFB` and then one more time in `efdtInt{Up,Dn}FB`. My first implementation in comment:52 (for which you provided a simplification in comment:54) dealt with the latter, while the idea in comment:60 is supposed to deal with the former. Sadly, case-of-case seems to undo the painful de-duplication of the `c` parameter (that's what comment:71 is about). Why doesn't LLF help here? Well, lifting out `c` to top-level gets rid of allocations for `c`, but there's still at least the allocation for the thunk for `go_up (x+1)` (the `Int` box goes away because of strictness). Also, the call to `go_up` is still an unknown call, as opposed to the simple join call we would get by inlining `c`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:74 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): ---- I can think of another transformation that would save the day here: We just have to put `c` in the same recursive group as `go_up` and recognize the mutual recursive join point. Seems like a better way than to mess with the simplifier and we don't even risk duplication of any code! So {{{ let c x k s = case (body x s) of (_, s') -> k s' in ... let go_up x | isTrue# (x ># y') = I# x `c` n | otherwise = I# x `c` go_up (x +# delta) ==> float `c` inwards into the same recursive group, specialise it for `go_up (x+#delta)` and `n` as `k` (SpecConstr? Would entail seeing tail- calls as kind-of a pattern match for functions) join go_up x | isTrue# (x ># y') = jump c1 (I# x) | otherwise = c2 (I# x) go_up (x +# delta) c1 x s = -- k = n case (body x s) of (_, s') -> n s' c2 x s = -- k = go_up (x +# delta) case (body x s) of (_, s') -> go_up (x +# delta) s' }}} Well, it's probably not SpecConstr that will do the specialisation. Also, why specialise when we could just inline `c`? Seems like we risk duplication of the potentially huge `body` after all. Although the same weakness doesn't apply to the situation in comment:71: It's enough to specialise for the `emit` argument (which serves a similar role as `go_up`) without any specific arguments to see that it's tail called: {{{ let c_a2jU x k s = case (body x s) of (_, s') -> k s' in join emit_a4hf next_ok next = ... case ==# next_ovf_a3hz delta_ovf_a3h0 of { __DEFAULT -> case b_a2jS of { __DEFAULT -> c_a2jU (GHC.Types.I# ds_d42Y) (emit_a3hf GHC.Types.False next_a3hx) }; 1# -> case b_a2jS of next_ok_a2k8 { __DEFAULT -> c_a2jU (GHC.Types.I# ds_d42Y) (emit_a3hf next_ok_a2k8 next_a3hx) } } ==> Specialising `c` for the call pattern `[ds s next_ok next] |> [I# ds, emit next_ok next, s]` as `c'` join c' ds s next_ok next = case (body (I# ds) s) of (_, s') -> emit_a4hf next_ok next s' emit_a4hf next_ok next s = ... case ==# next_ovf_a3hz delta_ovf_a3h0 of { __DEFAULT -> case b_a2jS of { __DEFAULT -> jump c' ds_d42Y GHC.Types.False next_a3hx s }; 1# -> case b_a2jS of next_ok_a2k8 { __DEFAULT -> jump c' ds_d42Y next_ok_a2k8 next_a3hx s } } }}} This latter case is probably a lot easier to handle. Maybe this is worth some specialised pass? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:75 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Thanks for the explanation. Try this {{{ forM_2 :: (Foldable t, Monad m) => t a -> (a -> m b) -> m () forM_2 xs f = let c x k = f x >> k {-# INLINE c #-} in foldr c (return ()) xs }}} and use `forM_2` instead of `forM_` in the outer calls in `f` and `g`. I then get good results for both. How does this work? Well by marking `c` as INLINE, I prevent `f` from inlining into it -- remember, the promise of INLINE things is that what you write gets inlined. And this is what we want: `c` is small, just `f x >> k`, and inlining it is very very good. Without the INLINE pragmas on `c` we have something like {{{ let f = BIG in let c x k = f x >> k in BODY }}} Since `f` occurs just once, we inline `f` to give {{{ let c x k = BIG x >> k in BODY }}} and now `c` becomes too big to inline. This is a classic inlining dilemma: do we inline `f` into `c` or `c` into `BODY`? The latter is much better in this case. I think we could build this into the libraries just by changing the definition of `mapM_`. Do you agree? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:76 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Yes, that works and seems a lot simpler! I didn't know that `INLINE` works 'both ways', e.g. that we can prevent inlining the body into `c` this way. The only minor downside is that there might be more unidentified cases where such an `INLINE` annotation on a lambda could be beneficial. I'll prepare a patch and benchmark. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:77 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):
I didn't know that INLINE works 'both ways', e.g. that we can prevent inlining the body into c this way.
Yes, it's super-important that INLINE works like that. Consider {{{ f x = <very big> g y = Just (f y) {-# INLINE g #-} }}} where that is the only occurrence of `f`, but there are zillions of occurrences of `g`. You would be jolly annoyed if GHC inlined `<very big>` into `g`, and then inlined the new `g` at each of its zillions of call sites!! No: INLINE says "when you see a saturated application of this function, inline ''the RHSI wrote'' at the call site". It might be useful to beef up the documentation of INLINE to explain this, if you felt able to do so. I agree that there may be other places such an INLINE would be desirable. A good start would be a careful Note on `mapM_` explaining from first principles why the local let and INLINE is important. Then others can refer to that Note. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:78 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Right, actually I was aware of INLINE unfoldings being captured as the unoptimized variant, but have never seen it to actively prevent (or rather postpone) inlining of another part of the code. Quite a neat trick! I've identified some other functions that should probably rewritten the same way (`traverse_`, `foldrM`, etc.). No regressions, no improvements according to NoFib. Phab:D5131 passes `./validate.sh`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:79 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Phab:D5131 Wiki Page: | -------------------------------------+------------------------------------- Changes (by sgraf): * differential: => Phab:D5131 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:80 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Phab:D5131 Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Here's a longer, and to me more comprehensible, Note {{{ Note [List fusion and continuations in 'c'] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Suppose we define mapM_ f = foldr ((>>) . f) (return ()) (this is the way it used to be). Now suppose we want to optimise the call mapM_ <big> (build g) where g c n = ...(c x1 y1)...(c x2 y2)....n... GHC used to proceed like this: mapM_ <big> (build g) = { Defintion of mapM_ } foldr ((>>) . <big>) (return ()) (build g) = { foldr/build rule } g ((>>) . <big>) (return ()) = { Inline g ] let c = (>>) . <big> n = return () in ...(c x1 y1)...(c x2 y2)....n... The trouble is that `c`, being big, will not be inlined. And that can be absolutely terrible for performance, as we saw in Trac #8763. It's much better to define mapM_ f = foldr c (return ()) where c x k = f x >> k {-# INLINE c #-} Now we get mapM_ <big> (build g) = { inline mapM_ } foldr c (return ()) (build g) where c x k = f x >> k {-# INLINE c #-} f = <big> Notice that `f` does not inine into the RHS of `c`, because the ININE pragma stops it; see Note [How INLINE pragmas /prevent/ inlining]. Continuing: = { foldr/build rule } g c (return ()) where ... c x k = f x >> k {-# INLINE c #-} f = <big> = { inline g } ...(c x1 y1)...(c x2 y2)....n... where c x k = f x >> k {-# INLINE c #-} f = <big> n = return () Now, crucially, `c` does inline = { inline c } ...(f x1 >> y1)...(f x2 >> y2)....n... where f = <big> n = return () And all is well! The key thing is that the fragment `(f x1 >> y1)` is inlined into the body of the builder `g`. }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:81 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Phab:D5131 Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): OK, I revised the patch. I agree, your examples are worth a thousand words. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:82 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Phab:D5131 Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Well, I revised the patch but I am now at work, where they block some outgoing ports needed for pushing to staging. Long story short, this will have to wait until tonight or at least until I figured out a way to configure a proxy. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:83 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Phab:D5131 Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Let's commit this! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:84 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more)
-------------------------------------+-------------------------------------
Reporter: nh2 | Owner: (none)
Type: bug | Status: new
Priority: normal | Milestone: 8.8.1
Component: Compiler | Version: 7.6.3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #7206 | Differential Rev(s): Phab:D5131
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Krzysztof Gogolewski

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: merge Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Phab:D5131 Wiki Page: | -------------------------------------+------------------------------------- Changes (by monoidal): * status: new => merge Comment: Not sure if we'd like to merge this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:86 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8763: forM_ [1..N] does not get fused (allocates 50% more) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.6.3 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Phab:D5131 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * status: merge => closed * resolution: => fixed * milestone: 8.6.2 => 8.8.1 Comment: I'm going to leave this for 8.8. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:88 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC