
#14013: Bad monads performance -------------------------------------+------------------------------------- Reporter: danilo2 | Owner: (none) Type: bug | Status: new Priority: high | Milestone: Component: Compiler | Version: 8.2.1-rc3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by danilo2: Old description:
Hi! We've been struggling with a very strange GHC behavior on IRC today. Let's consider the following code (needs mtl and criterion to be compiled):
{{{ module Main where
import Prelude import Criterion.Main import qualified Control.Monad.State.Strict as Strict import qualified Control.Monad.State.Class as State import Control.DeepSeq (NFData, rnf, force) import GHC.IO (evaluate) import Data.Monoid
----------------------------- -- === Criterion utils === -- -----------------------------
eval :: NFData a => a -> IO a eval = evaluate . force ; {-# INLINE eval #-}
liftExp :: (Int -> a) -> (Int -> a) liftExp f = f . (10^) ; {-# INLINE liftExp #-}
expCodeGen :: NFData a => (Int -> a) -> (Int -> IO a) expCodeGen f i = do putStrLn $ "generating input code (10e" <> show i <> " chars)" out <- eval $ liftExp f i putStrLn "code generated sucessfully" return out {-# INLINE expCodeGen #-}
expCodeGenBench :: (NFData a, NFData b) => (Int -> a) -> (a -> b) -> Int -> Benchmark expCodeGenBench f p i = env (expCodeGen f i) $ bench ("10e" <> show i) . nf p ; {-# INLINE expCodeGenBench #-}
------------------------------- -- === (a*) list parsing === -- -------------------------------
genList_a :: Int -> [Char] genList_a i = replicate i 'a' ; {-# INLINE genList_a #-}
pureListParser_a :: [Char] -> Bool pureListParser_a = \case 'a':s -> pureListParser_a s [] -> True _ -> False {-# INLINE pureListParser_a #-}
mtlStateListParser_a :: State.MonadState [Char] m => m Bool mtlStateListParser_a = State.get >>= \case 'a':s -> State.put s >> mtlStateListParser_a [] -> return True _ -> return False {-# INLINE mtlStateListParser_a #-}
mtlStateListParser_a_typed :: Strict.State [Char] Bool mtlStateListParser_a_typed = State.get >>= \case 'a':s -> State.put s >> mtlStateListParser_a_typed [] -> return True _ -> return False {-# INLINE mtlStateListParser_a_typed #-}
mtlStateListParser_a_let :: Strict.MonadState [Char] m => m Bool mtlStateListParser_a_let = go where go = Strict.get >>= \case 'a':s -> Strict.put s >> go [] -> return True _ -> return False {-# INLINE mtlStateListParser_a_let #-}
{-# SPECIALIZE mtlStateListParser_a :: Strict.State [Char] Bool #-} {-# SPECIALIZE mtlStateListParser_a_typed :: Strict.State [Char] Bool #-}
main = do defaultMain [ bgroup "a*" $ [ bgroup "pure" $ expCodeGenBench genList_a pureListParser_a <$> [6..6] , bgroup "mtl.State.Strict" $ expCodeGenBench genList_a (Strict.evalState mtlStateListParser_a) <$> [6..6] , bgroup "mtl.State.Strict typed" $ expCodeGenBench genList_a (Strict.evalState mtlStateListParser_a_typed) <$> [6..6] , bgroup "mtl.State.Strict let" $ expCodeGenBench genList_a (Strict.evalState mtlStateListParser_a_let) <$> [6..6] ] ]
}}}
The code was compiled with following options (and many other variations): `-threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100 -funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick- factor=1000 -flate-dmd-anal`
Everything in this code has `INLINE` pragma. The important part we should focus on are these two functions:
{{{
pureListParser_a :: [Char] -> Bool pureListParser_a = \case 'a':s -> pureListParser_a s [] -> True _ -> False {-# INLINE pureListParser_a #-}
mtlStateListParser_a :: State.MonadState [Char] m => m Bool mtlStateListParser_a = State.get >>= \case 'a':s -> State.put s >> mtlStateListParser_a [] -> return True _ -> return False {-# INLINE mtlStateListParser_a #-} }}}
Which are just "parsers" accepting strings containing only 'a' characters. The former is pure one, while the later uses `State` to keep the remaining input. The following list contains performance related observations:
0. For the rest of the points, let's call the performance of `pureListParser_a` a "good" one and everything worse a "bad" one.
1. The performance of `mtlStateListParser_a` is bad, it runs 10 times slower than `pureListParser_a`. Inspecting CORE we can observe that GHC jumps between `(# a,b #)` and `(a,b)` representations all the time.
2. If we add a specialize pragma `{-# SPECIALIZE mtlStateListParser_a :: Strict.State [Char] Bool #-}`, the performance of `mtlStateListParser_a` is good (exactly the same as `pureListParser_a`).
3. If we do NOT use specialize pragma, but we use explicite, non- polymorphic type signature `mtlStateListParser_a_typed :: Strict.State [Char] Bool`, the performance is bad (!), identical to the polymorphic version without specialization.
4. If we use SPECIALIZE pragma together with explicite, non-polymorphic type, so we use BOTH `mtlStateListParser_a_typed :: Strict.State [Char] Bool` AND `{-# SPECIALIZE mtlStateListParser_a_typed :: Strict.State [Char] Bool #-}` we get the good performance.
5. If we transform `pureListParser_a` to
{{{ mtlStateListParser_a_let :: Strict.MonadState [Char] m => m Bool mtlStateListParser_a_let = go where go = Strict.get >>= \case 'a':s -> Strict.put s >> go [] -> return True _ -> return False {-# INLINE mtlStateListParser_a_let #-} }}}
we again get the good performance without the need to use `SPECIALIZE` pragmas.
6. The performance of all the functions that are not optimized as good as `pureListParser_a` is a lot worse in GHC 8.2.1-rc3 than in 8.0.2.
The above points raise the following questions:
1. Why GHC does not optimize `mtlStateListParser_a` the same way as `pureListParser_a` and where the jumping between `(# a,b #)` and `(a,b)` comes from?
2. Is there any way to tell GHC to automatically insert `SPECIALIZE` pragmas, especially in performance critical code?
3. Why providing very-explicite type signature `mtlStateListParser_a_typed :: Strict.State [Char] Bool` does not solve the problem unless we use `SPECIALIZE` pragma that tells the same as the signature?
4. Why the trick to alias the body of recursive function to a local variable `go` affects the performance in any way, especially when it does NOT bring any variable to the local let scope?
We've been testing this behavior in GHC 8.0.2 and 8.2.1-rc3 and several people reported exactly the same observations.
New description: Hi! We've been struggling with a very strange GHC behavior on IRC today. Let's consider the following code (needs mtl and criterion to be compiled): {{{ module Main where import Prelude import Criterion.Main import qualified Control.Monad.State.Strict as Strict import qualified Control.Monad.State.Class as State import Control.DeepSeq (NFData, rnf, force) import GHC.IO (evaluate) import Data.Monoid ----------------------------- -- === Criterion utils === -- ----------------------------- eval :: NFData a => a -> IO a eval = evaluate . force ; {-# INLINE eval #-} liftExp :: (Int -> a) -> (Int -> a) liftExp f = f . (10^) ; {-# INLINE liftExp #-} expCodeGen :: NFData a => (Int -> a) -> (Int -> IO a) expCodeGen f i = do putStrLn $ "generating input code (10e" <> show i <> " chars)" out <- eval $ liftExp f i putStrLn "code generated sucessfully" return out {-# INLINE expCodeGen #-} expCodeGenBench :: (NFData a, NFData b) => (Int -> a) -> (a -> b) -> Int -> Benchmark expCodeGenBench f p i = env (expCodeGen f i) $ bench ("10e" <> show i) . nf p ; {-# INLINE expCodeGenBench #-} ------------------------------- -- === (a*) list parsing === -- ------------------------------- genList_a :: Int -> [Char] genList_a i = replicate i 'a' ; {-# INLINE genList_a #-} pureListParser_a :: [Char] -> Bool pureListParser_a = \case 'a':s -> pureListParser_a s [] -> True _ -> False {-# INLINE pureListParser_a #-} mtlStateListParser_a :: State.MonadState [Char] m => m Bool mtlStateListParser_a = State.get >>= \case 'a':s -> State.put s >> mtlStateListParser_a [] -> return True _ -> return False {-# INLINE mtlStateListParser_a #-} mtlStateListParser_a_typed :: Strict.State [Char] Bool mtlStateListParser_a_typed = State.get >>= \case 'a':s -> State.put s >> mtlStateListParser_a_typed [] -> return True _ -> return False {-# INLINE mtlStateListParser_a_typed #-} mtlStateListParser_a_let :: Strict.MonadState [Char] m => m Bool mtlStateListParser_a_let = go where go = Strict.get >>= \case 'a':s -> Strict.put s >> go [] -> return True _ -> return False {-# INLINE mtlStateListParser_a_let #-} {-# SPECIALIZE mtlStateListParser_a :: Strict.State [Char] Bool #-} {-# SPECIALIZE mtlStateListParser_a_typed :: Strict.State [Char] Bool #-} main = do defaultMain [ bgroup "a*" $ [ bgroup "pure" $ expCodeGenBench genList_a pureListParser_a <$> [6..6] , bgroup "mtl.State.Strict" $ expCodeGenBench genList_a (Strict.evalState mtlStateListParser_a) <$> [6..6] , bgroup "mtl.State.Strict typed" $ expCodeGenBench genList_a (Strict.evalState mtlStateListParser_a_typed) <$> [6..6] , bgroup "mtl.State.Strict let" $ expCodeGenBench genList_a (Strict.evalState mtlStateListParser_a_let) <$> [6..6] ] ] }}} The code was compiled with following options (and many other variations): `-threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100 -funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick- factor=1000 -flate-dmd-anal` Everything in this code has `INLINE` pragma. The important part we should focus on are these two functions: {{{ pureListParser_a :: [Char] -> Bool pureListParser_a = \case 'a':s -> pureListParser_a s [] -> True _ -> False {-# INLINE pureListParser_a #-} mtlStateListParser_a :: State.MonadState [Char] m => m Bool mtlStateListParser_a = State.get >>= \case 'a':s -> State.put s >> mtlStateListParser_a [] -> return True _ -> return False {-# INLINE mtlStateListParser_a #-} }}} Which are just "parsers" accepting strings containing only 'a' characters. The former is pure one, while the later uses `State` to keep the remaining input. The following list contains performance related observations: 0. For the rest of the points, let's call the performance of `pureListParser_a` a "good" one and everything worse a "bad" one. 1. The performance of `mtlStateListParser_a` is bad, it runs 10 times slower than `pureListParser_a`. Inspecting CORE we can observe that GHC jumps between `(# a,b #)` and `(a,b)` representations all the time. 2. If we add a specialize pragma `{-# SPECIALIZE mtlStateListParser_a :: Strict.State [Char] Bool #-}`, the performance of `mtlStateListParser_a` is good (exactly the same as `pureListParser_a`). 3. If we do NOT use specialize pragma, but we use explicite, non- polymorphic type signature `mtlStateListParser_a_typed :: Strict.State [Char] Bool`, the performance is bad (!), identical to the polymorphic version without specialization. 4. If we use SPECIALIZE pragma together with explicite, non-polymorphic type, so we use BOTH `mtlStateListParser_a_typed :: Strict.State [Char] Bool` AND `{-# SPECIALIZE mtlStateListParser_a_typed :: Strict.State [Char] Bool #-}` we get the good performance. 5. If we transform `pureListParser_a` to {{{ mtlStateListParser_a_let :: Strict.MonadState [Char] m => m Bool mtlStateListParser_a_let = go where go = Strict.get >>= \case 'a':s -> Strict.put s >> go [] -> return True _ -> return False {-# INLINE mtlStateListParser_a_let #-} }}} we again get the good performance without the need to use `SPECIALIZE` pragmas. 6. The performance of all the functions that are not optimized as good as `pureListParser_a` is a lot worse in GHC 8.2.1-rc3 than in 8.0.2. 7. The not-yet documented flag `-fspecialise-aggressively` does NOT affect the results. The above points raise the following questions: 1. Why GHC does not optimize `mtlStateListParser_a` the same way as `pureListParser_a` and where the jumping between `(# a,b #)` and `(a,b)` comes from? 2. Is there any way to tell GHC to automatically insert `SPECIALIZE` pragmas, especially in performance critical code? 3. Why providing very-explicite type signature `mtlStateListParser_a_typed :: Strict.State [Char] Bool` does not solve the problem unless we use `SPECIALIZE` pragma that tells the same as the signature? 4. Why the trick to alias the body of recursive function to a local variable `go` affects the performance in any way, especially when it does NOT bring any variable to the local let scope? We've been testing this behavior in GHC 8.0.2 and 8.2.1-rc3 and several people reported exactly the same observations. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14013#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler