[GHC] #16040: Unboxing-Related Performance Issue with Polymorphic Functions

#16040: Unboxing-Related Performance Issue with Polymorphic Functions -------------------------------------+------------------------------------- Reporter: _recursion | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: Runtime Unknown/Multiple | performance bug Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- My team has observed a 2x performance degradation in code that makes use of `StateT` that appears to be related to strictness and unboxing, even when built with `-O2`. Our code makes heavy use of the state monad, and when GHC fails to optimise this performance We've managed to minimise the behaviour to the following reproducer (that ignores state entirely), consisting of two files. It depends on `criterion` `Lib.hs`: {{{#!hs {-# LANGUAGE BangPatterns #-} module Lib where -- A type to take the place of state data X a = X { runX :: !a } test1 :: Int -> Int test1 = \(!i) -> go i where go = \(!i) -> if i > 0 then go $! i - 1 else i {-# NOINLINE test1 #-} test2 :: Int -> Int test2 = \(!i) -> runX (go i) where go = \(!i) -> if i > 0 then go $! i - 1 else X i {-# NOINLINE test2 #-} }}} `Main.hs`: {{{#!hs {-# LANGUAGE Strict #-} module Main where import Lib import Criterion import Criterion.Main main :: IO () main = defaultMain [ bgroup "main" [ bgroup "small" [ bench "without state" $ whnf test1 100000000 , bench "with state" $ whnf test2 100000000 ] ] ] }}} Run as above, the code takes twice as long to execute `test2` as it does `test1`. However, when the signature for `runX` is changed to `runX :: !Int`, the code in `test2` exhibits identical performance to `test1`. It has been reproduced across multiple linux64 machines, but not tested on any other architecture or operating system. Please find the full (stack - yes, I know) project as an attachment. You can simply `stack run` to observe the issue. If you have any further queries please let me know. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16040 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16040: Unboxing-Related Performance Issue with Polymorphic Functions -------------------------------------+------------------------------------- Reporter: _recursion | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by _recursion): * Attachment "bug.tar.gz" added. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16040 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16040: Unboxing-Related Performance Issue with Polymorphic Functions -------------------------------------+------------------------------------- Reporter: _recursion | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): I have not yet tried with criterion, but I tried with {{{ main = print (test1 10000000) }}} or `test2`, and ran with `+RTS -s`. I got the same amount of allocation either way. Do you see that too? If so, it's not a heap-allocation issue. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16040#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16040: Unboxing-Related Performance Issue with Polymorphic Functions -------------------------------------+------------------------------------- Reporter: _recursion | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by _recursion): Good point! I forgot to check that. Allocations are very close to each other but not identical. Both of the outputs are attached. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16040#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16040: Unboxing-Related Performance Issue with Polymorphic Functions -------------------------------------+------------------------------------- Reporter: _recursion | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by _recursion): * Attachment "test1.log" added. `test1` allocation report -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16040 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16040: Unboxing-Related Performance Issue with Polymorphic Functions -------------------------------------+------------------------------------- Reporter: _recursion | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by _recursion): * Attachment "test2.log" added. `test2` allocation report -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16040 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16040: Unboxing-Related Performance Issue with Polymorphic Functions -------------------------------------+------------------------------------- Reporter: _recursion | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by mpickering): I think this would be fixed by nested cpr, see #1600. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16040#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16040: Unboxing-Related Performance Issue with Polymorphic Functions -------------------------------------+------------------------------------- Reporter: _recursion | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by _recursion): A cursory read looks like it would, yes! Nevertheless, this is pretty concerning. I now have no idea why most of our codebase isn't getting hit by this issue, and suddenly I've run into a place where it occurs. I'm a little afraid for the performance stability of our code across GHC upgrades now. Is it something that could only be fixed by nested CPR or could a band-aid fix be put in place, do you think? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16040#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16040: Unboxing-Related Performance Issue with Polymorphic Functions -------------------------------------+------------------------------------- Reporter: _recursion | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Interesting. For `test1` (the fast one) we get {{{ Rec { $wgo_r2xK :: GHC.Prim.Int# -> GHC.Prim.Int# $wgo_r2xK = \ (ww_s2w4 :: GHC.Prim.Int#) -> case GHC.Prim.># ww_s2w4 0# of { __DEFAULT -> ww_s2w4; 1# -> $wgo_r2xK (GHC.Prim.-# ww_s2w4 1#) } end Rec } T16040.$wtest1 :: GHC.Prim.Int# -> GHC.Prim.Int# T16040.$wtest1 = \ (ww_s2wd :: GHC.Prim.Int#) -> $wgo_r2xK ww_s2wd test1 :: Int -> Int test1 = \ (w_s2wa :: Int) -> case w_s2wa of { GHC.Types.I# ww1_s2wd -> case T16040.$wtest1 ww1_s2wd of ww2_s2wh { __DEFAULT -> GHC.Types.I# ww2_s2wh } } }}} Notice that loop `$wgo` does not allocation at all. For `test2` we get {{{ Rec { $wgo1_r2xL :: GHC.Prim.Int# -> (# Int #) $wgo1_r2xL = \ (ww_s2wm :: GHC.Prim.Int#) -> case GHC.Prim.># ww_s2wm 0# of { __DEFAULT -> (# GHC.Types.I# ww_s2wm #); 1# -> $wgo1_r2xL (GHC.Prim.-# ww_s2wm 1#) } end Rec } T16040.$wtest2 :: GHC.Prim.Int# -> Int T16040.$wtest2 = \ (ww_s2wv :: GHC.Prim.Int#) -> case $wgo1_r2xL ww_s2wv of { (# ww2_s2wy #) -> ww2_s2wy } test2 :: Int -> Int test2 = \ (w_s2ws :: Int) -> case w_s2ws of { GHC.Types.I# ww1_s2wv -> T16040.$wtest2 ww1_s2wv } }}} Here the `$wgo1` loop does no allocation on its hot path, but does allocate an `I# ww` box as it returns. I think that `$wgo1` is doing a heap-check on ''every'' iteration, at the start of the function. It would be better to do the check only on the ''last'' iteration, in the `DEFAULT` branch. I bet these redundant heap checks are what is taking the time. We have long-standing tickets about this; see the summary in comment:2 of Trac #14791. I would love someone to work on this. To catch cases like this should not be very hard. By "like this" I mean * A primitive case * Which has no allocation "upstream" (i.e. before it) * And at least one alternative does no allocation. ---------- Matthew is right that Nested CPR would also fix this. The trouble is that the recursive `go` from `test2` returns an `X Int`, whose representation has two levels of box; eg `X (I# 3#)`. The current CPR transform can't optimise that. Nested-CPR can, and would make ''neither'' branch of `$wgo2` allocate Tantalizingly, we have a nested-cpr patch nearly ready to go. (See Trac #1600.) But it needs someone to pay it some sustained attention. ---------- I don't see an obvious band-aid. This is a long-standing issue, not a regression. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16040#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16040: Unboxing-Related Performance Issue with Polymorphic Functions -------------------------------------+------------------------------------- Reporter: _recursion | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by _recursion): Thank you for the detailed explanation, Simon! I took a look at the core but missed seeing the allocation previously. Regarding fixing this or CPR, would nested CPR fix all cases 'like this', as you put it, or even if the nested CPR patch lands would there still be a need for functionality to catch cases like this? The reason I ask is that I'd be quite willing to take a look at either over the holiday in some free time, assuming I can find some guidance, and I'm wondering which would be best to attack. Regarding the nested-cpr patch, what kind of attention is needed? Do you think that, ''with'' said attention it could be landed for 8.8? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16040#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Regarding fixing this or CPR, would nested CPR fix all cases 'like
#16040: Unboxing-Related Performance Issue with Polymorphic Functions -------------------------------------+------------------------------------- Reporter: _recursion | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): this', as you put it, or even if the nested CPR patch lands would there still be a need for functionality to catch cases like this? I think we want ''both'' nested CPR ''and'' better placement of heap checks. They do different things -- it just so happens that in this case either would solve it. Great that you can work on it. I'd suggest the heap-check placement one, because it's a well-contained problem, I think. Nested CPR may have a longer on-ramp. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16040#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16040: Unboxing-Related Performance Issue with Polymorphic Functions -------------------------------------+------------------------------------- Reporter: _recursion | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by _recursion): “You think” - sounds promising! But yes if both are going to be useful then I think I’ll try my hand at the heap-check placemement one when I next find some time! It sounds like a decently well-contained problem, and I’ve been meaning to contribute more than just filing tickets for a while. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16040#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16040: Unboxing-Related Performance Issue with Polymorphic Functions -------------------------------------+------------------------------------- Reporter: _recursion | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Old description:
My team has observed a 2x performance degradation in code that makes use of `StateT` that appears to be related to strictness and unboxing, even when built with `-O2`. Our code makes heavy use of the state monad, and when GHC fails to optimise this performance
We've managed to minimise the behaviour to the following reproducer (that ignores state entirely), consisting of two files. It depends on `criterion`
`Lib.hs`:
{{{#!hs {-# LANGUAGE BangPatterns #-}
module Lib where
-- A type to take the place of state data X a = X { runX :: !a }
test1 :: Int -> Int test1 = \(!i) -> go i where go = \(!i) -> if i > 0 then go $! i - 1 else i {-# NOINLINE test1 #-}
test2 :: Int -> Int test2 = \(!i) -> runX (go i) where go = \(!i) -> if i > 0 then go $! i - 1 else X i {-# NOINLINE test2 #-} }}}
`Main.hs`:
{{{#!hs {-# LANGUAGE Strict #-} module Main where
import Lib import Criterion import Criterion.Main
main :: IO () main = defaultMain [ bgroup "main" [ bgroup "small" [ bench "without state" $ whnf test1 100000000 , bench "with state" $ whnf test2 100000000 ] ] ] }}}
Run as above, the code takes twice as long to execute `test2` as it does `test1`. However, when the signature for `runX` is changed to `runX :: !Int`, the code in `test2` exhibits identical performance to `test1`.
It has been reproduced across multiple linux64 machines, but not tested on any other architecture or operating system.
Please find the full (stack - yes, I know) project as an attachment. You can simply `stack run` to observe the issue. If you have any further queries please let me know.
New description: My team has observed a 2x performance degradation in code that makes use of `StateT` that appears to be related to strictness and unboxing, even when built with `-O2`. Our code makes heavy use of the state monad, and when GHC fails to optimise this usage, the performance becomes untenably slow. We've managed to minimise the behaviour to the following reproducer (that ignores state entirely), consisting of two files. It depends on `criterion` `Lib.hs`: {{{#!hs {-# LANGUAGE BangPatterns #-} module Lib where -- A type to take the place of state data X a = X { runX :: !a } test1 :: Int -> Int test1 = \(!i) -> go i where go = \(!i) -> if i > 0 then go $! i - 1 else i {-# NOINLINE test1 #-} test2 :: Int -> Int test2 = \(!i) -> runX (go i) where go = \(!i) -> if i > 0 then go $! i - 1 else X i {-# NOINLINE test2 #-} }}} `Main.hs`: {{{#!hs {-# LANGUAGE Strict #-} module Main where import Lib import Criterion import Criterion.Main main :: IO () main = defaultMain [ bgroup "main" [ bgroup "small" [ bench "without state" $ whnf test1 100000000 , bench "with state" $ whnf test2 100000000 ] ] ] }}} Run as above, the code takes twice as long to execute `test2` as it does `test1`. However, when the signature for `runX` is changed to `runX :: !Int`, the code in `test2` exhibits identical performance to `test1`. It has been reproduced across multiple linux64 machines, but not tested on any other architecture or operating system. Please find the full (stack - yes, I know) project as an attachment. You can simply `stack run` to observe the issue. If you have any further queries please let me know. -- Comment (by _recursion): Unfinished sentence in the report because I am sometimes unobservant. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16040#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16040: Unboxing-Related Performance Issue with Polymorphic Functions -------------------------------------+------------------------------------- Reporter: _recursion | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: CPRAnalysis Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by simonpj): * keywords: => CPRAnalysis -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16040#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16040: Unboxing-Related Performance Issue with Polymorphic Functions -------------------------------------+------------------------------------- Reporter: _recursion | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.2 Resolution: | Keywords: CPRAnalysis Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by sgraf): * cc: sgraf (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16040#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC