[GHC] #14035: Weird performance results.

27 Jul 2017

      #14035: Weird performance results.
-------------------------------------+-------------------------------------
           Reporter:  danilo2        |             Owner:  (none)
               Type:  bug            |            Status:  new
           Priority:  high           |         Milestone:
          Component:  Compiler       |           Version:  8.0.1
           Keywords:                 |  Operating System:  Unknown/Multiple
       Architecture:                 |   Type of failure:  None/Unknown
  Unknown/Multiple                   |
          Test Case:                 |        Blocked By:
           Blocking:                 |   Related Tickets:
Differential Rev(s):                 |         Wiki Page:
-------------------------------------+-------------------------------------
 Hi! I was recently testing performance of a critical code in a product we
 are shipping and I'm getting really weird results. The first "part" of it
 I've described here: https://ghc.haskell.org/trac/ghc/ticket/14013 (and I
 will be referring to this issue later), however there are some other
 problems that do not seem directly connected to 14013 ticket.

 **The code is compiled with `-XStrict` enabled globally. The full source
 code for this ticket is attached.**
 The code is a pseudo-parser implementation. It consumes any char in a loop
 and fails on empty input in the end.

 \\

 == Helpers

 Let's define 2 helpers:
 {{{#!hs
 (.) :: (b -> c) -> (a -> b) -> a -> c
 (.) f g = \x -> f (g x) ; {-# INLINE (.) #-}

 dotl :: (b -> c) -> (a -> b) -> a -> c
 dotl ~f ~g = \ ~x -> f (g x) ; {-# INLINE dotl #-}
 }}}

 So whenever we see `.` in code it is strict in all of its arguments.

 \\

 == Strict StateT performance improvement

 Let's consider following code:
 {{{#!hs
 import qualified Control.Monad.State.Strict as S

 newtype StateT s m a = StateT { fromStateT :: S.StateT s m a } deriving
 (Applicative, Functor, Monad, MonadTrans)

 class MonadState s m | m -> s where
     get :: m s
     put :: s -> m ()

 runStateT  :: forall s m a.              StateT s m a -> s -> m (a, s)
 evalStateT :: forall s m a. Functor m => StateT s m a -> s -> m a
 runStateT  m s = S.runStateT  (fromStateT m) s ; {-# INLINE runStateT  #-}
 evalStateT m   = fmap fst . runStateT m        ; {-# INLINE evalStateT #-}

 instance Monad m => MonadState s (StateT s m) where
     get = StateT S.get   ; {-# INLINE get #-}
     put = StateT . S.put ; {-# INLINE put #-}
 }}}

 There are few non-obvious things to note here:
   1. This wrapper performs about **15 TIMES better** than
 `Control.Monad.State.Strict.StateT` (in the provided examples) and if we
 create a loop in pure code imitating a parser, this `StateT` gets
 completely optimized away, while the `mtl`'s version does not.

   2. If we replace the following functions with lazy composition, we get
 the same, high performance:
 {{{#!hs
 runStateT    = S.runStateT `dotl` fromStateT ; {-# INLINE runStateT  #-}
 evalStateT m = fmap fst `dotl` runStateT m   ; {-# INLINE evalStateT #-}
 }}}

   3. However, if we slightly change the `evalStateT`, we've got the bad
 performance, equals to the `mtl`'s `StateT` version (15 times slower):
 {{{#!hs
 evalStateT m a = fmap fst $ runStateT m a ; {-# INLINE evalStateT #-}
 }}}

 It's a very strange result, especially that `evalStateT` is used only once
 in the code while running the tests.

 \\

 == Strict Either & EitherT

 The code contains a very minimalistic implementation of `Either` and
 `EitherT` in order to make their definitions and utils strict. These
 definitions are copy-pasted and simplified (removed unused code) from:
 https://hackage.haskell.org/package/base-4.10.0.0/docs/src/Data.Either.html
 https://hackage.haskell.org/package/either-4.4.1.1/docs/src/Control.Monad.Tr...

 \\

 == Strict Bool and tuple

 Moreover we define strict Bool `or` operation and 2-element tuple with
 strict arguments:

 {{{#!hs
 data T a b = T !a !b deriving (Generic, Show, Functor)

 data XBool = XTrue | XFalse deriving (Show, Generic)

 (|||) :: XBool -> XBool -> XBool
 (|||) !a !b = case a of
     XTrue  -> a
     XFalse -> b
 {-# INLINE (|||) #-}
 }}}

 \\

 == Parser implementation

 All the above declarations were simple helpers compiled with `-XStrict`,
 because available libraries do not provide them for us. This code is a
 "real" use case and shows the weird performance results.

 The parser implementation is simple:
 {{{#!hs
 newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T
 XBool a) } deriving (Functor)

 instance Monad m => Applicative (FailParser m) where
     pure  = undefined
     (<*>) = undefined

 instance Monad m => Monad (FailParser m) where
     return a = FailParser $ pure $ (T XFalse a) ; {-# INLINE return #-}
     FailParser ma >>= f = FailParser $ do
         T !b  a  <- ma
         T !b' a' <- fromFailParser $ f a
         return $! T (b ||| b') a'
     {-# INLINE (>>=) #-}
     _ >> _ = undefined ; {-# INLINE (>>) #-}

 instance MonadTrans (FailParser) where
     lift m = FailParser $! lift $ fmap (T XFalse) m ; {-# INLINE lift #-}
 }}}

 We use `undefined` for non-important functions. The parser is `EitherT`
 wrapper: Left happens when we failed parsing input, while Right otherwise.
 The `XBool` denotes if we made any progress (so after consuming a letter
 it is set to `XTrue`). There are some additional util functions, like
 `returnProgress` which behaves just like return, but also sets the `XBool`
 value to `XTrue`:

 {{{#!hs
 instance Monad m => MonadProgressParser (FailParser m) where
     returnProgress a = FailParser $! pure (T XFalse a) ; {-# INLINE
 returnProgress #-} -- In correct code it should be XTrue as described
 below.
 }}}

 In the provided code, there is `XFalse` used instead of `XTrue` because of
 some interesting observations:

 1. The `XBool` value is used ONLY in the expression `return $! T (b |||
 b') a'` above, so it does NOT affect the way the program logically
 executes it's body.

 2. Both `b` and `b'` are strict and fully evaluated.

 3. If both `b` and `b'` are `XFalse` (as in the provided code, they always
 are `XFalse`) we get a good performance. In order to test it, the above
 code contains `XFalse` instead of `XTrue`.

 4. If we use the correct version of `returnProgress` as described just
 before point 1 above, we get 15 times slower performance (the same or very
 similar to the one when used altered `evalStateT` definition). We could
 try to explain it: maybe Haskell was able to optimize code if it
 discovered, there always were only `XFalse` values used and after the
 change there are both `XTrue` and `XFalse`, so it really has to run the
 `(|||)` operator. This way of thinking fails as fast as we check that
 changing `XFalse` to `XTrue` **everywhere** in the code give us again bad
 performance.

 5. If we replace `return $! T (b ||| b') a'` with `return $! T b' a'` we
 get good performance, while replacing it with `return $! T b a'` give us
 bad performance. It does not make any sense, because both `b` and `b'` are
 strict and fully evaluated. Moreover, it is the only place in code where
 they are used.

 6. However replacing `return $! T (b ||| b') a'` with `return $! T (b' |||
 b) a'` does NOT change the performance (we are getting the good one).

 == Final notes

 We've been talking with some people - both in the company I'm working in
 as well as on IRC and we do not see any reason why this code behaves in
 this way and why it is so sensitive to the changes. In fact we started to
 be worried a lot about how we can use Haskell for high-performance parts
 at all if it is not obvious if a very simple changes do not affect
 performance so much, like changing `evalStateT m = fmap fst . runStateT m`
 to `evalStateT m a = fmap fst $ runStateT m a`, which gives 15 times
 slowdown. It makes the code both very fragile to any changes as well as
 makes it unmaintainable. Tracking performance in a very small program
 (like the attached one) is possible, while tracking it in bigger one,
 taking in considerating the described problems, make it almost impossible.
 I'm writing this because I'm worried about where these problems originate
 from and I would really like to solve them / know why they appear and be
 sure we can continue to use Haskell for our high-performance demanding
 applications.

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14035
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

GHC

tags

participants (1)