Re: [GHC] #14035: Weird performance results.

27 Jul 2017

      #14035: Weird performance results.
-------------------------------------+-------------------------------------
        Reporter:  danilo2           |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  high              |            Milestone:
       Component:  Compiler          |              Version:  8.0.1
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
                                     |  Unknown/Multiple
 Type of failure:  None/Unknown      |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Description changed by danilo2:

Old description:
...
Hi! I was recently testing performance of a critical code in a product we
are shipping and I'm getting really weird results. The first "part" of it
I've described here: https://ghc.haskell.org/trac/ghc/ticket/14013 (and I
will be referring to this issue later), however there are some other
problems that do not seem directly connected to 14013 ticket.
**The code is compiled with `-XStrict` enabled globally. The full source
code for this ticket is attached.**
The code is a pseudo-parser implementation. It consumes any char in a
loop and fails on empty input in the end.
Everything was compiled with following options (and many variations):
`"-threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100
-funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick-
factor=1000 -flate-dmd-anal -fspecialise-aggressively"`.
\\
== Helpers
Let's define 2 helpers:
{{{#!hs
(.) :: (b -> c) -> (a -> b) -> a -> c
(.) f g = \x -> f (g x) ; {-# INLINE (.) #-}
dotl :: (b -> c) -> (a -> b) -> a -> c
dotl ~f ~g = \ ~x -> f (g x) ; {-# INLINE dotl #-}
}}}
So whenever we see `.` in code it is strict in all of its arguments.
\\
== Strict StateT performance improvement
Let's consider following code:
{{{#!hs
import qualified Control.Monad.State.Strict as S
newtype StateT s m a = StateT { fromStateT :: S.StateT s m a } deriving
(Applicative, Functor, Monad, MonadTrans)
class MonadState s m | m -> s where
    get :: m s
    put :: s -> m ()
runStateT  :: forall s m a.              StateT s m a -> s -> m (a, s)
evalStateT :: forall s m a. Functor m => StateT s m a -> s -> m a
runStateT  m s = S.runStateT  (fromStateT m) s ; {-# INLINE runStateT
#-}
evalStateT m   = fmap fst . runStateT m        ; {-# INLINE evalStateT
#-}
instance Monad m => MonadState s (StateT s m) where
    get = StateT S.get   ; {-# INLINE get #-}
    put = StateT . S.put ; {-# INLINE put #-}
}}}
There are few non-obvious things to note here:
  1. This wrapper performs about **15 TIMES better** than
`Control.Monad.State.Strict.StateT` (in the provided examples) and if we
create a loop in pure code imitating a parser, this `StateT` gets
completely optimized away, while the `mtl`'s version does not.
2. If we replace the following functions with lazy composition, we get
the same, high performance:
{{{#!hs
runStateT    = S.runStateT `dotl` fromStateT ; {-# INLINE runStateT  #-}
evalStateT m = fmap fst `dotl` runStateT m   ; {-# INLINE evalStateT #-}
}}}
3. However, if we slightly change the `evalStateT`, we've got the bad
performance, equals to the `mtl`'s `StateT` version (15 times slower):
{{{#!hs
evalStateT m a = fmap fst $ runStateT m a ; {-# INLINE evalStateT #-}
}}}
It's a very strange result, especially that `evalStateT` is used only
once in the code while running the tests.
\\
== Strict Either & EitherT
The code contains a very minimalistic implementation of `Either` and
`EitherT` in order to make their definitions and utils strict. These
definitions are copy-pasted and simplified (removed unused code) from:
https://hackage.haskell.org/package/base-4.10.0.0/docs/src/Data.Either.html
https://hackage.haskell.org/package/either-4.4.1.1/docs/src/Control.Monad.Tr...
\\
== Strict Bool and tuple
Moreover we define strict Bool `or` operation and 2-element tuple with
strict arguments:
{{{#!hs
data T a b = T !a !b deriving (Generic, Show, Functor)
data XBool = XTrue | XFalse deriving (Show, Generic)
(|||) :: XBool -> XBool -> XBool
(|||) !a !b = case a of
    XTrue  -> a
    XFalse -> b
{-# INLINE (|||) #-}
}}}
\\
== Parser implementation
All the above declarations were simple helpers compiled with `-XStrict`,
because available libraries do not provide them for us. This code is a
"real" use case and shows the weird performance results.
The parser implementation is simple:
{{{#!hs
newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T
XBool a) } deriving (Functor)
instance Monad m => Applicative (FailParser m) where
    pure  = undefined
    (<*>) = undefined
instance Monad m => Monad (FailParser m) where
    return a = FailParser $ pure $ (T XFalse a) ; {-# INLINE return #-}
    FailParser ma >>= f = FailParser $ do
        T !b  a  <- ma
        T !b' a' <- fromFailParser $ f a
        return $! T (b ||| b') a'
    {-# INLINE (>>=) #-}
    _ >> _ = undefined ; {-# INLINE (>>) #-}
instance MonadTrans (FailParser) where
    lift m = FailParser $! lift $ fmap (T XFalse) m ; {-# INLINE lift #-}
}}}
We use `undefined` for non-important functions. The parser is `EitherT`
wrapper: Left happens when we failed parsing input, while Right
otherwise. The `XBool` denotes if we made any progress (so after
consuming a letter it is set to `XTrue`). There are some additional util
functions, like `returnProgress` which behaves just like return, but also
sets the `XBool` value to `XTrue`:
{{{#!hs
instance Monad m => MonadProgressParser (FailParser m) where
    returnProgress a = FailParser $! pure (T XFalse a) ; {-# INLINE
returnProgress #-} -- In correct code it should be XTrue as described
below.
}}}
In the provided code, there is `XFalse` used instead of `XTrue` because
of some interesting observations:
1. The `XBool` value is used ONLY in the expression `return $! T (b |||
b') a'` above, so it does NOT affect the way the program logically
executes it's body.
2. Both `b` and `b'` are strict and fully evaluated.
3. If both `b` and `b'` are `XFalse` (as in the provided code, they
always are `XFalse`) we get a good performance. In order to test it, the
above code contains `XFalse` instead of `XTrue`.
4. If we use the correct version of `returnProgress` as described just
before point 1 above, we get 15 times slower performance (the same or
very similar to the one when used altered `evalStateT` definition). We
could try to explain it: maybe Haskell was able to optimize code if it
discovered, there always were only `XFalse` values used and after the
change there are both `XTrue` and `XFalse`, so it really has to run the
`(|||)` operator. This way of thinking fails as fast as we check that
changing `XFalse` to `XTrue` **everywhere** in the code give us again bad
performance.
5. If we replace `return $! T (b ||| b') a'` with `return $! T b' a'` we
get good performance, while replacing it with `return $! T b a'` give us
bad performance. It does not make any sense, because both `b` and `b'`
are strict and fully evaluated. Moreover, it is the only place in code
where they are used.
6. However replacing `return $! T (b ||| b') a'` with `return $! T (b'
||| b) a'` does NOT change the performance (we are getting the good one).

...
== Final notes
We've been talking with some people - both in the company I'm working in
as well as on IRC and we do not see any reason why this code behaves in
this way and why it is so sensitive to the changes. In fact we started to
be worried a lot about how we can use Haskell for high-performance parts
at all if it is not obvious if a very simple changes do not affect
performance so much, like changing `evalStateT m = fmap fst . runStateT
m` to `evalStateT m a = fmap fst $ runStateT m a`, which gives 15 times
slowdown. It makes the code both very fragile to any changes as well as
makes it unmaintainable. Tracking performance in a very small program
(like the attached one) is possible, while tracking it in bigger one,
taking in considerating the described problems, make it almost
impossible.  I'm writing this because I'm worried about where these
problems originate from and I would really like to solve them / know why
they appear and be sure we can continue to use Haskell for our high-
performance demanding applications.
New description:

 Hi! I was recently testing performance of a critical code in a product we
 are shipping and I'm getting really weird results.

 **The code is compiled with `-XStrict` enabled globally. The full source
 code for this ticket is attached.**
 The code is a pseudo-parser implementation. It consumes any char in a loop
 and fails on empty input in the end.

 Everything was compiled with following options (and many variations):
 `"-threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100
 -funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick-
 factor=1000 -flate-dmd-anal -fspecialise-aggressively"`.

 \\

 == Helpers

 Let's define 2 helpers:
 {{{#!hs
 (.) :: (b -> c) -> (a -> b) -> a -> c
 (.) f g = \x -> f (g x) ; {-# INLINE (.) #-}

 dotl :: (b -> c) -> (a -> b) -> a -> c
 dotl ~f ~g = \ ~x -> f (g x) ; {-# INLINE dotl #-}
 }}}

 So whenever we see `.` in code it is strict in all of its arguments.

 \\

 == Strict StateT performance improvement

 Let's consider following code:
 {{{#!hs
 import qualified Control.Monad.State.Strict as S

 newtype StateT s m a = StateT { fromStateT :: S.StateT s m a } deriving
 (Applicative, Functor, Monad, MonadTrans)

 class MonadState s m | m -> s where
     get :: m s
     put :: s -> m ()

 runStateT  :: forall s m a.              StateT s m a -> s -> m (a, s)
 evalStateT :: forall s m a. Functor m => StateT s m a -> s -> m a
 runStateT  m s = S.runStateT  (fromStateT m) s ; {-# INLINE runStateT  #-}
 evalStateT m   = fmap fst . runStateT m        ; {-# INLINE evalStateT #-}

 instance Monad m => MonadState s (StateT s m) where
     get = StateT S.get   ; {-# INLINE get #-}
     put = StateT . S.put ; {-# INLINE put #-}
 }}}

 There are few non-obvious things to note here:
   1. This wrapper performs about **15 TIMES better** than
 `Control.Monad.State.Strict.StateT` (in the provided examples) and if we
 create a loop in pure code imitating a parser, this `StateT` gets
 completely optimized away, while the `mtl`'s version does not.

   2. If we replace the following functions with lazy composition, we get
 the same, high performance:
 {{{#!hs
 runStateT    = S.runStateT `dotl` fromStateT ; {-# INLINE runStateT  #-}
 evalStateT m = fmap fst `dotl` runStateT m   ; {-# INLINE evalStateT #-}
 }}}

   3. However, if we slightly change the `evalStateT`, we've got the bad
 performance, equals to the `mtl`'s `StateT` version (15 times slower):
 {{{#!hs
 evalStateT m a = fmap fst $ runStateT m a ; {-# INLINE evalStateT #-}
 }}}

 It's a very strange result, especially that `evalStateT` is used only once
 in the code while running the tests.

 \\

 == Strict Either & EitherT

 The code contains a very minimalistic implementation of `Either` and
 `EitherT` in order to make their definitions and utils strict. These
 definitions are copy-pasted and simplified (removed unused code) from:
 https://hackage.haskell.org/package/base-4.10.0.0/docs/src/Data.Either.html
 https://hackage.haskell.org/package/either-4.4.1.1/docs/src/Control.Monad.Tr...

 \\

 == Strict Bool and tuple

 Moreover we define strict Bool `or` operation and 2-element tuple with
 strict arguments:

 {{{#!hs
 data T a b = T !a !b deriving (Generic, Show, Functor)

 data XBool = XTrue | XFalse deriving (Show, Generic)

 (|||) :: XBool -> XBool -> XBool
 (|||) !a !b = case a of
     XTrue  -> a
     XFalse -> b
 {-# INLINE (|||) #-}
 }}}

 \\

 == Parser implementation

 All the above declarations were simple helpers compiled with `-XStrict`,
 because available libraries do not provide them for us. This code is a
 "real" use case and shows the weird performance results.

 The parser implementation is simple:
 {{{#!hs
 newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T
 XBool a) } deriving (Functor)

 instance Monad m => Applicative (FailParser m) where
     pure  = undefined
     (<*>) = undefined

 instance Monad m => Monad (FailParser m) where
     return a = FailParser $ pure $ (T XFalse a) ; {-# INLINE return #-}
     FailParser ma >>= f = FailParser $ do
         T !b  a  <- ma
         T !b' a' <- fromFailParser $ f a
         return $! T (b ||| b') a'
     {-# INLINE (>>=) #-}
     _ >> _ = undefined ; {-# INLINE (>>) #-}

 instance MonadTrans (FailParser) where
     lift m = FailParser $! lift $ fmap (T XFalse) m ; {-# INLINE lift #-}
 }}}

 We use `undefined` for non-important functions. The parser is `EitherT`
 wrapper: Left happens when we failed parsing input, while Right otherwise.
 The `XBool` denotes if we made any progress (so after consuming a letter
 it is set to `XTrue`). There are some additional util functions, like
 `returnProgress` which behaves just like return, but also sets the `XBool`
 value to `XTrue`:

 {{{#!hs
 instance Monad m => MonadProgressParser (FailParser m) where
     returnProgress a = FailParser $! pure (T XFalse a) ; {-# INLINE
 returnProgress #-} -- In correct code it should be XTrue as described
 below.
 }}}

 In the provided code, there is `XFalse` used instead of `XTrue` because of
 some interesting observations:

 1. The `XBool` value is used ONLY in the expression `return $! T (b |||
 b') a'` above, so it does NOT affect the way the program logically
 executes it's body.

 2. Both `b` and `b'` are strict and fully evaluated.

 3. If both `b` and `b'` are `XFalse` (as in the provided code, they always
 are `XFalse`) we get a good performance. In order to test it, the above
 code contains `XFalse` instead of `XTrue`.

 4. If we use the correct version of `returnProgress` as described just
 before point 1 above, we get 15 times slower performance (the same or very
 similar to the one when used altered `evalStateT` definition). We could
 try to explain it: maybe Haskell was able to optimize code if it
 discovered, there always were only `XFalse` values used and after the
 change there are both `XTrue` and `XFalse`, so it really has to run the
 `(|||)` operator. This way of thinking fails as fast as we check that
 changing `XFalse` to `XTrue` **everywhere** in the code give us again bad
 performance.

 5. If we replace `return $! T (b ||| b') a'` with `return $! T b' a'` we
 get good performance, while replacing it with `return $! T b a'` give us
 bad performance. It does not make any sense, because both `b` and `b'` are
 strict and fully evaluated. Moreover, it is the only place in code where
 they are used.

 6. However replacing `return $! T (b ||| b') a'` with `return $! T (b' |||
 b) a'` does NOT change the performance (we are getting the good one).

 == Final notes

 We've been talking with some people - both in the company I'm working in
 as well as on IRC and we do not see any reason why this code behaves in
 this way and why it is so sensitive to the changes. In fact we started to
 be worried a lot about how we can use Haskell for high-performance parts
 at all if it is not obvious if a very simple changes do not affect
 performance so much, like changing `evalStateT m = fmap fst . runStateT m`
 to `evalStateT m a = fmap fst $ runStateT m a`, which gives 15 times
 slowdown. It makes the code both very fragile to any changes as well as
 makes it unmaintainable. Tracking performance in a very small program
 (like the attached one) is possible, while tracking it in bigger one,
 taking in considerating the described problems, make it almost impossible.
 I'm writing this because I'm worried about where these problems originate
 from and I would really like to solve them / know why they appear and be
 sure we can continue to use Haskell for our high-performance demanding
 applications.

--

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14035#comment:2
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler