[GHC] #11795: Performance issues with replicateM_

#11795: Performance issues with replicateM_ -------------------------------------+------------------------------------- Reporter: snoyberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: | Version: 7.10.3 libraries/base | Keywords: | Operating System: MacOS X Architecture: x86_64 | Type of failure: Runtime (amd64) | performance bug Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- When working on optimizing a program by minimizing allocations, I can into an issue with `replicateM_`. Consider the following code {{{#!hs import Control.Monad (replicateM_) import Foreign.C.String (withCString) import Foreign.Storable (peek) main :: IO () main = withCString "foo" $ replicateM_ 10000000 . peek }}} When I run this program, I get: 160,042,656 bytes allocated in the heap The result is the same whether I compile with `-O0`, `-O`, or `-O2`. And as expected, the total allocation increases or decreases based on the numbers of times I replicate the action. On the other hand, replacing `replicateM_` with a hand-written version makes the total allocations for the program only 42KB, and does not increase with the numbers of replications. {{{#!hs replicateM_ :: Monad m => Int -> m a -> m () replicateM_ cnt0 f = loop cnt0 where loop cnt | cnt <= 0 = return () | otherwise = f >> loop (cnt - 1) }}} By contrast, `Control.Monad.replicateM_` looks like: {{{#!hs replicateM_ :: (Monad m) => Int -> m a -> m () {-# INLINEABLE replicateM_ #-} {-# SPECIALISE replicateM_ :: Int -> IO a -> IO () #-} {-# SPECIALISE replicateM_ :: Int -> Maybe a -> Maybe () #-} replicateM_ n x = sequence_ (replicate n x) }}} I can't see an advantage to this implementation over the more direct implementation I've provided above. Unless there are objections, I'll send a patch to switch the implementation. (Since master already uses `Applicative`, I'll make the relevant updates to generalize the function signature too.) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11795 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11795: Performance issues with replicateM_ -------------------------------------+------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: new Priority: normal | Milestone: Component: libraries/base | Version: 7.10.3 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by snoyberg): * owner: => snoyberg -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11795#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11795: Performance issues with replicateM_ -------------------------------------+------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: new Priority: normal | Milestone: Component: libraries/base | Version: 7.10.3 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Michael, this is (sadly) a well known issue. #1168 has a list of related tickets, and #9388 has ideas and preliminary work on how to limit the scope of the hack. I think it'd be fine to switch implementations of `replicateM_`, but can you put a prominent notice on the master ticket #1168 so that whoever looks into it doesn't think "oh the problem has gone away". Your fix will (helpfully) cure the symptom but not the disease. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11795#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11795: Performance issues with replicateM_ -------------------------------------+------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: new Priority: normal | Milestone: Component: libraries/base | Version: 7.10.3 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by snoyberg): Thanks for the quick response Simon. I'm definitely not deeply familiar with the state hack, but from what I can see the issue I'm reporting here may be orthogonal. Specifically, if I pass in `-fno-state-hack`, I still see the large-allocation behavior. I believe the issue here is twofold: * The specialize pragma is getting the way of inlining firing * The behavior of generating and consuming a list is less efficient than the simpler code I've provided here I'll be happy to add comments to any other issues as you see fit, but I don't want to add confusion to an already complicated issue if this is in fact separate from the state hack. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11795#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11795: Performance issues with replicateM_ -------------------------------------+------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: new Priority: normal | Milestone: Component: libraries/base | Version: 7.10.3 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by snoyberg): Actually, it would appear that Ben already modified the implementation of `replicateM_` to something different than I saw: {{{#!hs replicateM_ 0 _ = pure () replicateM_ n x = x *> replicateM_ (pred n) x }}} The implementation I've given here still seems to perform better, due to usage of the worker/wrapper transform. It also keeps the semantics of the current version in GHC 7.10.3 for negative values. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11795#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11795: Performance issues with replicateM_ -------------------------------------+------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: new Priority: normal | Milestone: Component: libraries/base | Version: 7.10.3 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D2086 Wiki Page: | -------------------------------------+------------------------------------- Changes (by snoyberg): * differential: => Phab:D2086 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11795#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11795: Performance issues with replicateM_ -------------------------------------+------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: new Priority: normal | Milestone: Component: libraries/base | Version: 7.10.3 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D2086 Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): On Phab:D2086 Simon posed the very reasonable question of //why// these two differ so significantly in performance. In my original reading of the patch I had assumed that the reason was that the "fast" version would enable unboxing of the `Int` accumulator while the "slow" version would not. Looking at the Core, however, this does not appear to be the difference: this accumulator is unboxed in both cases. The actual difference is that the self-recursive nature of the slow version inhibits inlining, which means we never get to specialise to the action. Instead we end up with, {{{#!hs $w$sreplicateM_noWW :: forall a. Int# -> IO a -> State# RealWorld -> (# State# RealWorld, () #) $w$sreplicateM_noWW = \ (@ a) (ww :: Int#) (w :: IO a) (w1 :: State# RealWorld) -> case ww of ds { __DEFAULT -> case (w `cast` ...) w1 of _ { (# ipv, ipv1 #) -> $w$sreplicateM_noWW @ a (-# ds 1#) w ipv }; 0# -> (# w1, () #) } lvl2 :: Ptr CChar -> State# RealWorld -> (# State# RealWorld, () #) lvl2 = \ (x :: Ptr CChar) (eta :: State# RealWorld) -> $w$sreplicateM_noWW @ CChar 100000000# (($fStorableInt21 (x `cast` ...)) `cast` ...) eta }}} Which gives us O(n) slow calls. This stands in stark contrast to the fast variant, which compiles down to this beautiful allocation-free loop, {{{#!hs $wgo12 :: Int# -> State# RealWorld -> (# State# RealWorld, () #) $wgo12 = \ (ww :: Int#) (w :: State# RealWorld) -> case tagToEnum# @ Bool (<=# ww 0#) of _ { False -> case getForeignEncoding1 of _ { (getForeignEncoding5, setForeignEncoding1) -> case (getForeignEncoding5 `cast` ...) w of _ { (# ipv, ipv1 #) -> case charIsRepresentable3 @ () ipv1 lvl1 (lvl2 `cast` ...) ipv of _ { (# ipv2, ipv3 #) -> case seq# @ () @ RealWorld ipv3 ipv2 of _ { (# ipv4, ipv5 #) -> $wgo12 (-# ww 1#) ipv4 } } } }; True -> (# w, () #) } }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11795#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11795: Performance issues with replicateM_ -------------------------------------+------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: new Priority: normal | Milestone: Component: libraries/base | Version: 7.10.3 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D2086 Wiki Page: | -------------------------------------+------------------------------------- Comment (by snoyberg): Should I update my changes to refer to the above comment by Ben, which is a pretty thorough coverage of what's going on here? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11795#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11795: Performance issues with replicateM_ -------------------------------------+------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: new Priority: normal | Milestone: Component: libraries/base | Version: 7.10.3 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D2086 Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Great. So it inlines bodily, and specialises to the particular call site. That is precisely what I speculated in my Phab comment. So yes, it'd good to make a Note in the code which says what my comment says, and points to this ticket for more detail. I've learned that it's extremely helpful leaving breadcrumbs like this in the code, so the next time someone is fiddling with that code, they do so in the light of earlier learning. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11795#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11795: Performance issues with replicateM_ -------------------------------------+------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: merge Priority: normal | Milestone: Component: libraries/base | Version: 7.10.3 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D2086 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * status: new => merge -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11795#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11795: Performance issues with replicateM_ -------------------------------------+------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: merge Priority: normal | Milestone: 8.0.1 Component: libraries/base | Version: 7.10.3 Resolution: | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D2086 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * milestone: => 8.0.1 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11795#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11795: Performance issues with replicateM_
-------------------------------------+-------------------------------------
Reporter: snoyberg | Owner: snoyberg
Type: bug | Status: merge
Priority: normal | Milestone: 8.0.1
Component: libraries/base | Version: 7.10.3
Resolution: | Keywords:
Operating System: MacOS X | Architecture: x86_64
Type of failure: Runtime | (amd64)
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s): Phab:D2086
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#11795: Performance issues with replicateM_ -------------------------------------+------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: closed Priority: normal | Milestone: 8.0.1 Component: libraries/base | Version: 7.10.3 Resolution: fixed | Keywords: Operating System: MacOS X | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D2086 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * status: merge => closed * resolution: => fixed Comment: Merged to `ghc-8.0` as 7c6bc78fe1110be426de0bf95157f114d216b3aa. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11795#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC