
#14035: Weird performance results. -------------------------------------+------------------------------------- Reporter: danilo2 | Owner: (none) Type: bug | Status: new Priority: high | Milestone: Component: Compiler | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by danilo2: Old description:
Hi! I was recently testing performance of a critical code in a product we are shipping and I'm getting really weird results. The first "part" of it I've described here: https://ghc.haskell.org/trac/ghc/ticket/14013 (and I will be referring to this issue later), however there are some other problems that do not seem directly connected to 14013 ticket.
**The code is compiled with `-XStrict` enabled globally. The full source code for this ticket is attached.** The code is a pseudo-parser implementation. It consumes any char in a loop and fails on empty input in the end.
Everything was compiled with following options (and many variations): `"-threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100 -funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick- factor=1000 -flate-dmd-anal -fspecialise-aggressively"`.
\\
== Helpers
Let's define 2 helpers: {{{#!hs (.) :: (b -> c) -> (a -> b) -> a -> c (.) f g = \x -> f (g x) ; {-# INLINE (.) #-}
dotl :: (b -> c) -> (a -> b) -> a -> c dotl ~f ~g = \ ~x -> f (g x) ; {-# INLINE dotl #-} }}}
So whenever we see `.` in code it is strict in all of its arguments.
\\
== Strict StateT performance improvement
Let's consider following code: {{{#!hs import qualified Control.Monad.State.Strict as S
newtype StateT s m a = StateT { fromStateT :: S.StateT s m a } deriving (Applicative, Functor, Monad, MonadTrans)
class MonadState s m | m -> s where get :: m s put :: s -> m ()
runStateT :: forall s m a. StateT s m a -> s -> m (a, s) evalStateT :: forall s m a. Functor m => StateT s m a -> s -> m a runStateT m s = S.runStateT (fromStateT m) s ; {-# INLINE runStateT #-} evalStateT m = fmap fst . runStateT m ; {-# INLINE evalStateT #-}
instance Monad m => MonadState s (StateT s m) where get = StateT S.get ; {-# INLINE get #-} put = StateT . S.put ; {-# INLINE put #-} }}}
There are few non-obvious things to note here: 1. This wrapper performs about **15 TIMES better** than `Control.Monad.State.Strict.StateT` (in the provided examples) and if we create a loop in pure code imitating a parser, this `StateT` gets completely optimized away, while the `mtl`'s version does not.
2. If we replace the following functions with lazy composition, we get the same, high performance: {{{#!hs runStateT = S.runStateT `dotl` fromStateT ; {-# INLINE runStateT #-} evalStateT m = fmap fst `dotl` runStateT m ; {-# INLINE evalStateT #-} }}}
3. However, if we slightly change the `evalStateT`, we've got the bad performance, equals to the `mtl`'s `StateT` version (15 times slower): {{{#!hs evalStateT m a = fmap fst $ runStateT m a ; {-# INLINE evalStateT #-} }}}
It's a very strange result, especially that `evalStateT` is used only once in the code while running the tests.
\\
== Strict Either & EitherT
The code contains a very minimalistic implementation of `Either` and `EitherT` in order to make their definitions and utils strict. These definitions are copy-pasted and simplified (removed unused code) from: https://hackage.haskell.org/package/base-4.10.0.0/docs/src/Data.Either.html https://hackage.haskell.org/package/either-4.4.1.1/docs/src/Control.Monad.Tr...
\\
== Strict Bool and tuple
Moreover we define strict Bool `or` operation and 2-element tuple with strict arguments:
{{{#!hs data T a b = T !a !b deriving (Generic, Show, Functor)
data XBool = XTrue | XFalse deriving (Show, Generic)
(|||) :: XBool -> XBool -> XBool (|||) !a !b = case a of XTrue -> a XFalse -> b {-# INLINE (|||) #-} }}}
\\
== Parser implementation
All the above declarations were simple helpers compiled with `-XStrict`, because available libraries do not provide them for us. This code is a "real" use case and shows the weird performance results.
The parser implementation is simple: {{{#!hs newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T XBool a) } deriving (Functor)
instance Monad m => Applicative (FailParser m) where pure = undefined (<*>) = undefined
instance Monad m => Monad (FailParser m) where return a = FailParser $ pure $ (T XFalse a) ; {-# INLINE return #-} FailParser ma >>= f = FailParser $ do T !b a <- ma T !b' a' <- fromFailParser $ f a return $! T (b ||| b') a' {-# INLINE (>>=) #-} _ >> _ = undefined ; {-# INLINE (>>) #-}
instance MonadTrans (FailParser) where lift m = FailParser $! lift $ fmap (T XFalse) m ; {-# INLINE lift #-} }}}
We use `undefined` for non-important functions. The parser is `EitherT` wrapper: Left happens when we failed parsing input, while Right otherwise. The `XBool` denotes if we made any progress (so after consuming a letter it is set to `XTrue`). There are some additional util functions, like `returnProgress` which behaves just like return, but also sets the `XBool` value to `XTrue`:
{{{#!hs instance Monad m => MonadProgressParser (FailParser m) where returnProgress a = FailParser $! pure (T XFalse a) ; {-# INLINE returnProgress #-} -- In correct code it should be XTrue as described below. }}}
In the provided code, there is `XFalse` used instead of `XTrue` because of some interesting observations:
1. The `XBool` value is used ONLY in the expression `return $! T (b ||| b') a'` above, so it does NOT affect the way the program logically executes it's body.
2. Both `b` and `b'` are strict and fully evaluated.
3. If both `b` and `b'` are `XFalse` (as in the provided code, they always are `XFalse`) we get a good performance. In order to test it, the above code contains `XFalse` instead of `XTrue`.
4. If we use the correct version of `returnProgress` as described just before point 1 above, we get 15 times slower performance (the same or very similar to the one when used altered `evalStateT` definition). We could try to explain it: maybe Haskell was able to optimize code if it discovered, there always were only `XFalse` values used and after the change there are both `XTrue` and `XFalse`, so it really has to run the `(|||)` operator. This way of thinking fails as fast as we check that changing `XFalse` to `XTrue` **everywhere** in the code give us again bad performance.
5. If we replace `return $! T (b ||| b') a'` with `return $! T b' a'` we get good performance, while replacing it with `return $! T b a'` give us bad performance. It does not make any sense, because both `b` and `b'` are strict and fully evaluated. Moreover, it is the only place in code where they are used.
6. However replacing `return $! T (b ||| b') a'` with `return $! T (b' ||| b) a'` does NOT change the performance (we are getting the good one).
== Final notes
We've been talking with some people - both in the company I'm working in as well as on IRC and we do not see any reason why this code behaves in this way and why it is so sensitive to the changes. In fact we started to be worried a lot about how we can use Haskell for high-performance parts at all if it is not obvious if a very simple changes do not affect performance so much, like changing `evalStateT m = fmap fst . runStateT m` to `evalStateT m a = fmap fst $ runStateT m a`, which gives 15 times slowdown. It makes the code both very fragile to any changes as well as makes it unmaintainable. Tracking performance in a very small program (like the attached one) is possible, while tracking it in bigger one, taking in considerating the described problems, make it almost impossible. I'm writing this because I'm worried about where these problems originate from and I would really like to solve them / know why they appear and be sure we can continue to use Haskell for our high- performance demanding applications.
New description: Hi! I was recently testing performance of a critical code in a product we are shipping and I'm getting really weird results. **The code is compiled with `-XStrict` enabled globally. The full source code for this ticket is attached.** The code is a pseudo-parser implementation. It consumes any char in a loop and fails on empty input in the end. Everything was compiled with following options (and many variations): `"-threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100 -funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick- factor=1000 -flate-dmd-anal -fspecialise-aggressively"`. \\ == Helpers Let's define 2 helpers: {{{#!hs (.) :: (b -> c) -> (a -> b) -> a -> c (.) f g = \x -> f (g x) ; {-# INLINE (.) #-} dotl :: (b -> c) -> (a -> b) -> a -> c dotl ~f ~g = \ ~x -> f (g x) ; {-# INLINE dotl #-} }}} So whenever we see `.` in code it is strict in all of its arguments. \\ == Strict StateT performance improvement Let's consider following code: {{{#!hs import qualified Control.Monad.State.Strict as S newtype StateT s m a = StateT { fromStateT :: S.StateT s m a } deriving (Applicative, Functor, Monad, MonadTrans) class MonadState s m | m -> s where get :: m s put :: s -> m () runStateT :: forall s m a. StateT s m a -> s -> m (a, s) evalStateT :: forall s m a. Functor m => StateT s m a -> s -> m a runStateT m s = S.runStateT (fromStateT m) s ; {-# INLINE runStateT #-} evalStateT m = fmap fst . runStateT m ; {-# INLINE evalStateT #-} instance Monad m => MonadState s (StateT s m) where get = StateT S.get ; {-# INLINE get #-} put = StateT . S.put ; {-# INLINE put #-} }}} There are few non-obvious things to note here: 1. This wrapper performs about **15 TIMES better** than `Control.Monad.State.Strict.StateT` (in the provided examples) and if we create a loop in pure code imitating a parser, this `StateT` gets completely optimized away, while the `mtl`'s version does not. 2. If we replace the following functions with lazy composition, we get the same, high performance: {{{#!hs runStateT = S.runStateT `dotl` fromStateT ; {-# INLINE runStateT #-} evalStateT m = fmap fst `dotl` runStateT m ; {-# INLINE evalStateT #-} }}} 3. However, if we slightly change the `evalStateT`, we've got the bad performance, equals to the `mtl`'s `StateT` version (15 times slower): {{{#!hs evalStateT m a = fmap fst $ runStateT m a ; {-# INLINE evalStateT #-} }}} It's a very strange result, especially that `evalStateT` is used only once in the code while running the tests. \\ == Strict Either & EitherT The code contains a very minimalistic implementation of `Either` and `EitherT` in order to make their definitions and utils strict. These definitions are copy-pasted and simplified (removed unused code) from: https://hackage.haskell.org/package/base-4.10.0.0/docs/src/Data.Either.html https://hackage.haskell.org/package/either-4.4.1.1/docs/src/Control.Monad.Tr... \\ == Strict Bool and tuple Moreover we define strict Bool `or` operation and 2-element tuple with strict arguments: {{{#!hs data T a b = T !a !b deriving (Generic, Show, Functor) data XBool = XTrue | XFalse deriving (Show, Generic) (|||) :: XBool -> XBool -> XBool (|||) !a !b = case a of XTrue -> a XFalse -> b {-# INLINE (|||) #-} }}} \\ == Parser implementation All the above declarations were simple helpers compiled with `-XStrict`, because available libraries do not provide them for us. This code is a "real" use case and shows the weird performance results. The parser implementation is simple: {{{#!hs newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T XBool a) } deriving (Functor) instance Monad m => Applicative (FailParser m) where pure = undefined (<*>) = undefined instance Monad m => Monad (FailParser m) where return a = FailParser $ pure $ (T XFalse a) ; {-# INLINE return #-} FailParser ma >>= f = FailParser $ do T !b a <- ma T !b' a' <- fromFailParser $ f a return $! T (b ||| b') a' {-# INLINE (>>=) #-} _ >> _ = undefined ; {-# INLINE (>>) #-} instance MonadTrans (FailParser) where lift m = FailParser $! lift $ fmap (T XFalse) m ; {-# INLINE lift #-} }}} We use `undefined` for non-important functions. The parser is `EitherT` wrapper: Left happens when we failed parsing input, while Right otherwise. The `XBool` denotes if we made any progress (so after consuming a letter it is set to `XTrue`). There are some additional util functions, like `returnProgress` which behaves just like return, but also sets the `XBool` value to `XTrue`: {{{#!hs instance Monad m => MonadProgressParser (FailParser m) where returnProgress a = FailParser $! pure (T XFalse a) ; {-# INLINE returnProgress #-} -- In correct code it should be XTrue as described below. }}} In the provided code, there is `XFalse` used instead of `XTrue` because of some interesting observations: 1. The `XBool` value is used ONLY in the expression `return $! T (b ||| b') a'` above, so it does NOT affect the way the program logically executes it's body. 2. Both `b` and `b'` are strict and fully evaluated. 3. If both `b` and `b'` are `XFalse` (as in the provided code, they always are `XFalse`) we get a good performance. In order to test it, the above code contains `XFalse` instead of `XTrue`. 4. If we use the correct version of `returnProgress` as described just before point 1 above, we get 15 times slower performance (the same or very similar to the one when used altered `evalStateT` definition). We could try to explain it: maybe Haskell was able to optimize code if it discovered, there always were only `XFalse` values used and after the change there are both `XTrue` and `XFalse`, so it really has to run the `(|||)` operator. This way of thinking fails as fast as we check that changing `XFalse` to `XTrue` **everywhere** in the code give us again bad performance. 5. If we replace `return $! T (b ||| b') a'` with `return $! T b' a'` we get good performance, while replacing it with `return $! T b a'` give us bad performance. It does not make any sense, because both `b` and `b'` are strict and fully evaluated. Moreover, it is the only place in code where they are used. 6. However replacing `return $! T (b ||| b') a'` with `return $! T (b' ||| b) a'` does NOT change the performance (we are getting the good one). == Final notes We've been talking with some people - both in the company I'm working in as well as on IRC and we do not see any reason why this code behaves in this way and why it is so sensitive to the changes. In fact we started to be worried a lot about how we can use Haskell for high-performance parts at all if it is not obvious if a very simple changes do not affect performance so much, like changing `evalStateT m = fmap fst . runStateT m` to `evalStateT m a = fmap fst $ runStateT m a`, which gives 15 times slowdown. It makes the code both very fragile to any changes as well as makes it unmaintainable. Tracking performance in a very small program (like the attached one) is possible, while tracking it in bigger one, taking in considerating the described problems, make it almost impossible. I'm writing this because I'm worried about where these problems originate from and I would really like to solve them / know why they appear and be sure we can continue to use Haskell for our high-performance demanding applications. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14035#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler