
#14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by nh2: Old description:
{{{#!hs selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e, VG.Mutable v ~ vm) => Int -> v e -> vm (PrimState m) e -> Int -> Int -> m () -- slow
selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e) => Int -> v e -> VG.Mutable v (PrimState m) e -> Int -> Int -> m () -- fast }}}
These two functions are identical except one has `VG.Mutable v ~ vm` as a constraint, the other one has it in the type signature right of the `=>`.
The second function is 10x faster.
I would expect them to be equally fast.
The code of the functions is identical, I change only the type declaration.
The slowness happens because with the first function, inlining of primitives like `unsafeRead` does not happen, and thus also it boxes the `Int#`s back to `Int`s when calling `unsafeRead`.
In particular, in `-ddump-simpl`, the slow version has
{{{#!hs $wpartitionLoop2_rgEy :: forall (m :: * -> *) (vm :: * -> * -> *) e. (PrimMonad m, MVector vm e, Ord e) => vm (PrimState m) e -> GHC.Prim.Int# -> e -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> m Int $wpartitionLoop2_rgEy = \... -> let { endIndex_a7Ay :: Int ... endIndex_a7Ay = GHC.Types.I# ww2_sfZn } in ... (VGM.basicUnsafeRead ... endIndex_a7Ay) ... }}}
while the fast version has
{{{#!hs $w$spartitionLoop2_rgUN :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.MutableByteArray# GHC.Prim.RealWorld -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #)
...
$spartitionLoop2_rgUP :: VG.Mutable VU.Vector (PrimState IO) Int -> Int -> Int -> Int -> Int -> Int -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #) }}}
So with `VG.Mutable v (PrimState m) e` in the type signature, GHC managed to inline + specialise it all the way down to concrete types (`VUM.MVector`, `IO`, and `Int` as the element type), and consequently inline `basicUnsafeRead` on unboxed `Int#`.
But with `VG.Mutable v ~ vm`, ghc keeps `vm (PrimState m) e` all the way, passes around dictionaries, thus eventually cannot inline `basicUnsafeRead` and re-packs already unboxed values, like `endIndex_a7Ay = GHC.Types.I# ww2_sfZn`, before passing them into the non-inlined call of `basicUnsafeRead`, thus making a tight loop allocate that normally wouldn't allocate.
Why might rewriting the type signature in such a trivial way make this happen?
----
I have tested this on GHC 8.0.2, GHC 8.2.2, and GHC 8.5 HEAD commit cc4677c36ee.
Reproducer:
* https://github.com/nh2/haskell-quickselect-median-of- medians/blob/0efd6293e779fda2d864ec3d75329fb16b8af6d9/Main.hs#L506 * Running instructions are [https://github.com/nh2/haskell-quickselect- median-of-medians/commit/7a49d673990dfaebdb0ba837c3fbbaae0455dba0 in this commit message]; for short: `stack exec -- ghc -O --make Main.hs -rtsopts -ddump-simpl -dsuppress-coercions -fforce-recomp -ddump-to-file -fno- full-laziness && ./Main +RTS -sstderr` * For that file, I have pregenerated `-dverbose-core2core` output here: https://github.com/nh2/haskell-quickselect-median-of- medians/tree/0efd6293e779fda2d864ec3d75329fb16b8af6d9/slowness-analysis * I originally just wanted to write a fast median-of-medians implementation on ZuriHac 2017, but got totally derailed by this performance problem. The version of it that I link here is a total mess because of me mauling it to track down this performance issue.
Trying to increase `-funfolding-use-threshold` or `-funfolding-keeness- factor` did not change the situation.
New description: {{{#!hs selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e, VG.Mutable v ~ vm) => Int -> v e -> vm (PrimState m) e -> Int -> Int -> m () -- slow selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e) => Int -> v e -> VG.Mutable v (PrimState m) e -> Int -> Int -> m () -- fast }}} These two functions are identical except one has `VG.Mutable v ~ vm` as a constraint, the other one has it in the type signature right of the `=>`. The second function is 10x faster. I would expect them to be equally fast. The code of the functions is identical, I change only the type declaration. The slowness happens because with the first function, inlining of primitives like `unsafeRead` does not happen, and thus also it boxes the `Int#`s back to `Int`s when calling `unsafeRead`. In particular, in `-ddump-simpl`, the slow version has {{{#!hs $wpartitionLoop2_rgEy :: forall (m :: * -> *) (vm :: * -> * -> *) e. (PrimMonad m, MVector vm e, Ord e) => vm (PrimState m) e -> GHC.Prim.Int# -> e -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> m Int $wpartitionLoop2_rgEy = \... -> let { endIndex_a7Ay :: Int ... endIndex_a7Ay = GHC.Types.I# ww2_sfZn } in ... (VGM.basicUnsafeRead ... endIndex_a7Ay) ... }}} while the fast version has {{{#!hs $w$spartitionLoop2_rgUN :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.MutableByteArray# GHC.Prim.RealWorld -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #) ... $spartitionLoop2_rgUP :: VG.Mutable VU.Vector (PrimState IO) Int -> Int -> Int -> Int -> Int -> Int -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #) }}} So with `VG.Mutable v (PrimState m) e` in the type signature, GHC managed to inline + specialise it all the way down to concrete types (`VUM.MVector`, `IO`, and `Int` as the element type), and consequently inline `basicUnsafeRead` on unboxed `Int#`. But with `VG.Mutable v ~ vm`, ghc keeps `vm (PrimState m) e` all the way, passes around dictionaries, thus eventually cannot inline `basicUnsafeRead` and re-packs already unboxed values, like `endIndex_a7Ay = GHC.Types.I# ww2_sfZn`, before passing them into the non-inlined call of `basicUnsafeRead`, thus making a tight loop allocate that normally wouldn't allocate. Why might rewriting the type signature in such a trivial way make this happen? ---- I have tested this on GHC 8.0.2, GHC 8.2.2, and GHC 8.5 HEAD commit cc4677c36ee (edit: in which case I commented out the quickcheck and criterion related stuff because their deps don't build there yet). Reproducer: * https://github.com/nh2/haskell-quickselect-median-of- medians/blob/0efd6293e779fda2d864ec3d75329fb16b8af6d9/Main.hs#L506 * Running instructions are [https://github.com/nh2/haskell-quickselect- median-of-medians/commit/7a49d673990dfaebdb0ba837c3fbbaae0455dba0 in this commit message]; for short: `stack exec -- ghc -O --make Main.hs -rtsopts -ddump-simpl -dsuppress-coercions -fforce-recomp -ddump-to-file -fno-full- laziness && ./Main +RTS -sstderr` * For that file, I have pregenerated `-dverbose-core2core` output here: https://github.com/nh2/haskell-quickselect-median-of- medians/tree/0efd6293e779fda2d864ec3d75329fb16b8af6d9/slowness-analysis * I originally just wanted to write a fast median-of-medians implementation on ZuriHac 2017, but got totally derailed by this performance problem. The version of it that I link here is a total mess because of me mauling it to track down this performance issue. Trying to increase `-funfolding-use-threshold` or `-funfolding-keeness- factor` did not change the situation. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14941#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler