[GHC] #14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown)

#14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: Runtime Unknown/Multiple | performance bug Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- {{{#!hs selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e, VG.Mutable v ~ vm) => Int -> v e -> vm (PrimState m) e -> Int -> Int -> m () -- slow selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e) => Int -> v e -> VG.Mutable v (PrimState m) e -> Int -> Int -> m () -- fast }}} These two functions are identical except one has `VG.Mutable v ~ vm` as a constraint, the other one has it in the type signature right of the `=>`. The second function is 10x faster. I would expect them to be equally fast. The code of the functions is identical, I change only the type declaration. The slowness happens because with the first function, inlining of primitives like `unsafeRead` does not happen, and thus also it boxes the `Int#`s back to `Int`s when calling `unsafeRead`. In particular, in `-ddump-simpl`, the slow version has {{{#!hs $wpartitionLoop2_rgEy :: forall (m :: * -> *) (vm :: * -> * -> *) e. (PrimMonad m, MVector vm e, Ord e) => vm (PrimState m) e -> GHC.Prim.Int# -> e -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> m Int $wpartitionLoop2_rgEy = \... -> let { endIndex_a7Ay :: Int ... endIndex_a7Ay = GHC.Types.I# ww2_sfZn } in ... (VGM.basicUnsafeRead ... endIndex_a7Ay) ... }}} while the fast version has {{{#!hs $w$spartitionLoop2_rgUN :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.MutableByteArray# GHC.Prim.RealWorld -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #) ... $spartitionLoop2_rgUP :: VG.Mutable VU.Vector (PrimState IO) Int -> Int -> Int -> Int -> Int -> Int -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #) }}} So with `VG.Mutable v (PrimState m) e` in the type signature, GHC managed to inline + specialise it all the way down to concrete types (`VUM.MVector`, `IO`, and `Int` as the element type), and consequently inline `basicUnsafeRead` on unboxed `Int#`. But with `VG.Mutable v ~ vm`, ghc keeps `vm (PrimState m) e` all the way, passes around dictionaries, thus eventually cannot inline `basicUnsafeRead` and re-packs already unboxed values, like `endIndex_a7Ay = GHC.Types.I# ww2_sfZn`, before passing them into the non-inlined call of `basicUnsafeRead`, thus making a tight loop allocate that normally wouldn't allocate. Why might rewriting the type signature in such a trivial way make this happen? ---- I have tested this on GHC 8.0.2, GHC 8.2.2, and GHC 8.5 HEAD commit cc4677c36ee. Reproducer: * https://github.com/nh2/haskell-quickselect-median-of- medians/blob/0efd6293e779fda2d864ec3d75329fb16b8af6d9/Main.hs#L506 * Running instructions are [https://github.com/nh2/haskell-quickselect- median-of-medians/commit/7a49d673990dfaebdb0ba837c3fbbaae0455dba0 in this commit message]; for short: `stack exec -- ghc -O --make Main.hs -rtsopts -ddump-simpl -dsuppress-coercions -fforce-recomp -ddump-to-file -fno-full- laziness && ./Main +RTS -sstderr` * For that file, I have pregenerated `-dverbose-core2core` output here: https://github.com/nh2/haskell-quickselect-median-of- medians/tree/0efd6293e779fda2d864ec3d75329fb16b8af6d9/slowness-analysis * I originally just wanted to write a fast median-of-medians implementation on ZuriHac 2017, but got totally derailed by this performance problem. The version of it that I link here is a total mess because of me mauling it to track down this performance issue. Trying to increase `-funfolding-use-threshold` or `-funfolding-keeness- factor` did not change the situation. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14941 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by mpickering): I can reproduce this but I had to delete the criterion dependency in the cabal file. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14941#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by nh2: Old description:
{{{#!hs selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e, VG.Mutable v ~ vm) => Int -> v e -> vm (PrimState m) e -> Int -> Int -> m () -- slow
selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e) => Int -> v e -> VG.Mutable v (PrimState m) e -> Int -> Int -> m () -- fast }}}
These two functions are identical except one has `VG.Mutable v ~ vm` as a constraint, the other one has it in the type signature right of the `=>`.
The second function is 10x faster.
I would expect them to be equally fast.
The code of the functions is identical, I change only the type declaration.
The slowness happens because with the first function, inlining of primitives like `unsafeRead` does not happen, and thus also it boxes the `Int#`s back to `Int`s when calling `unsafeRead`.
In particular, in `-ddump-simpl`, the slow version has
{{{#!hs $wpartitionLoop2_rgEy :: forall (m :: * -> *) (vm :: * -> * -> *) e. (PrimMonad m, MVector vm e, Ord e) => vm (PrimState m) e -> GHC.Prim.Int# -> e -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> m Int $wpartitionLoop2_rgEy = \... -> let { endIndex_a7Ay :: Int ... endIndex_a7Ay = GHC.Types.I# ww2_sfZn } in ... (VGM.basicUnsafeRead ... endIndex_a7Ay) ... }}}
while the fast version has
{{{#!hs $w$spartitionLoop2_rgUN :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.MutableByteArray# GHC.Prim.RealWorld -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #)
...
$spartitionLoop2_rgUP :: VG.Mutable VU.Vector (PrimState IO) Int -> Int -> Int -> Int -> Int -> Int -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #) }}}
So with `VG.Mutable v (PrimState m) e` in the type signature, GHC managed to inline + specialise it all the way down to concrete types (`VUM.MVector`, `IO`, and `Int` as the element type), and consequently inline `basicUnsafeRead` on unboxed `Int#`.
But with `VG.Mutable v ~ vm`, ghc keeps `vm (PrimState m) e` all the way, passes around dictionaries, thus eventually cannot inline `basicUnsafeRead` and re-packs already unboxed values, like `endIndex_a7Ay = GHC.Types.I# ww2_sfZn`, before passing them into the non-inlined call of `basicUnsafeRead`, thus making a tight loop allocate that normally wouldn't allocate.
Why might rewriting the type signature in such a trivial way make this happen?
----
I have tested this on GHC 8.0.2, GHC 8.2.2, and GHC 8.5 HEAD commit cc4677c36ee.
Reproducer:
* https://github.com/nh2/haskell-quickselect-median-of- medians/blob/0efd6293e779fda2d864ec3d75329fb16b8af6d9/Main.hs#L506 * Running instructions are [https://github.com/nh2/haskell-quickselect- median-of-medians/commit/7a49d673990dfaebdb0ba837c3fbbaae0455dba0 in this commit message]; for short: `stack exec -- ghc -O --make Main.hs -rtsopts -ddump-simpl -dsuppress-coercions -fforce-recomp -ddump-to-file -fno- full-laziness && ./Main +RTS -sstderr` * For that file, I have pregenerated `-dverbose-core2core` output here: https://github.com/nh2/haskell-quickselect-median-of- medians/tree/0efd6293e779fda2d864ec3d75329fb16b8af6d9/slowness-analysis * I originally just wanted to write a fast median-of-medians implementation on ZuriHac 2017, but got totally derailed by this performance problem. The version of it that I link here is a total mess because of me mauling it to track down this performance issue.
Trying to increase `-funfolding-use-threshold` or `-funfolding-keeness- factor` did not change the situation.
New description: {{{#!hs selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e, VG.Mutable v ~ vm) => Int -> v e -> vm (PrimState m) e -> Int -> Int -> m () -- slow selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e) => Int -> v e -> VG.Mutable v (PrimState m) e -> Int -> Int -> m () -- fast }}} These two functions are identical except one has `VG.Mutable v ~ vm` as a constraint, the other one has it in the type signature right of the `=>`. The second function is 10x faster. I would expect them to be equally fast. The code of the functions is identical, I change only the type declaration. The slowness happens because with the first function, inlining of primitives like `unsafeRead` does not happen, and thus also it boxes the `Int#`s back to `Int`s when calling `unsafeRead`. In particular, in `-ddump-simpl`, the slow version has {{{#!hs $wpartitionLoop2_rgEy :: forall (m :: * -> *) (vm :: * -> * -> *) e. (PrimMonad m, MVector vm e, Ord e) => vm (PrimState m) e -> GHC.Prim.Int# -> e -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> m Int $wpartitionLoop2_rgEy = \... -> let { endIndex_a7Ay :: Int ... endIndex_a7Ay = GHC.Types.I# ww2_sfZn } in ... (VGM.basicUnsafeRead ... endIndex_a7Ay) ... }}} while the fast version has {{{#!hs $w$spartitionLoop2_rgUN :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.MutableByteArray# GHC.Prim.RealWorld -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #) ... $spartitionLoop2_rgUP :: VG.Mutable VU.Vector (PrimState IO) Int -> Int -> Int -> Int -> Int -> Int -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #) }}} So with `VG.Mutable v (PrimState m) e` in the type signature, GHC managed to inline + specialise it all the way down to concrete types (`VUM.MVector`, `IO`, and `Int` as the element type), and consequently inline `basicUnsafeRead` on unboxed `Int#`. But with `VG.Mutable v ~ vm`, ghc keeps `vm (PrimState m) e` all the way, passes around dictionaries, thus eventually cannot inline `basicUnsafeRead` and re-packs already unboxed values, like `endIndex_a7Ay = GHC.Types.I# ww2_sfZn`, before passing them into the non-inlined call of `basicUnsafeRead`, thus making a tight loop allocate that normally wouldn't allocate. Why might rewriting the type signature in such a trivial way make this happen? ---- I have tested this on GHC 8.0.2, GHC 8.2.2, and GHC 8.5 HEAD commit cc4677c36ee (edit: in which case I commented out the quickcheck and criterion related stuff because their deps don't build there yet). Reproducer: * https://github.com/nh2/haskell-quickselect-median-of- medians/blob/0efd6293e779fda2d864ec3d75329fb16b8af6d9/Main.hs#L506 * Running instructions are [https://github.com/nh2/haskell-quickselect- median-of-medians/commit/7a49d673990dfaebdb0ba837c3fbbaae0455dba0 in this commit message]; for short: `stack exec -- ghc -O --make Main.hs -rtsopts -ddump-simpl -dsuppress-coercions -fforce-recomp -ddump-to-file -fno-full- laziness && ./Main +RTS -sstderr` * For that file, I have pregenerated `-dverbose-core2core` output here: https://github.com/nh2/haskell-quickselect-median-of- medians/tree/0efd6293e779fda2d864ec3d75329fb16b8af6d9/slowness-analysis * I originally just wanted to write a fast median-of-medians implementation on ZuriHac 2017, but got totally derailed by this performance problem. The version of it that I link here is a total mess because of me mauling it to track down this performance issue. Trying to increase `-funfolding-use-threshold` or `-funfolding-keeness- factor` did not change the situation. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14941#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by mpickering): I copied the core of `selectVectorDestructive2` from the good and bad examples and diffed them. They looked basically the same apart from lots and lots of code like the following: In bad: {{{ 1# -> case $wlvl_rnHf @ VUM.MVector @ Int @ IO Data.Vector.Unboxed.Base.$fMVectorMVectorInt ((Data.Vector.Primitive.Mutable.MVector @ ghc- prim-0.5.2.0:GHC.Prim.RealWorld @ Int ww1_smP2 ww2_smP3 ww3_smP4) `cast` (Sym (Data.Vector.Unboxed.Base.N:R:MVectorsInt[0] <ghc- prim-0.5.2.0:GHC.Prim.RealWorld>_N) ; Sym (Data.Vector.Unboxed.Base.DR:MVectorsInt0[0] (Control.Monad.Primitive.D:R:PrimStateIO[0])) :: (Data.Vector.Primitive.Mutable.MVector ghc- prim-0.5.2.0:GHC.Prim.RealWorld Int :: *) ~R# (VUM.MVector (PrimState IO) Int :: *))) ww4_smP8 of wild_00 { } }}} In good: {{{ 1# -> case $wlvl_rnRh ww2_smWB ww3_smWC ww4_smWG of wild_00 { } }}} So it looks like on every call the `MVector` is being constructed rather than created one and shared? Anyone know immediately why this is such a problem? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14941#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by nh2): This seems to be unrelated to type families and only a problem with `~` in general, because: I now also found a case where using `(val ~ Int) =>` produces slower code than subsituting `val` with `Int` syntactically, or using `SPECIALISE`. {{{ myfun :: forall val . (Eq val, Show val, val ~ Int) => Mytype1 val -> Mytype2 val -> Mytype3 -> Mytype4 -> Mytype5 -> Mytype6 -> IO () }}} is slow, but {{{ {-# SPECIALIZE myfun :: Mytype1 Int -> Mytype2 Int -> Mytype3 -> Mytype4 -> Mytype5 -> Mytype6 -> IO () #-} myfun :: forall val . (Eq val, Show val, val ~ Int) => Mytype1 val -> Mytype2 val -> Mytype3 -> Mytype4 -> Mytype5 -> Mytype6 -> IO () }}} is fast (in my case, a 30% performance difference). Diffing the output of `-ddump-simpl -ddump-to-file -dsuppress-idinfo -dsuppress-coercions -dsuppress-module-prefixes -dsuppress-uniques` shows how the call looks different from a module that calls `myfun`: `val ~ Int` version: {{{ $d~~ :: (Int :: *) ~~ (Int :: *) $d~~ = Eq# @ * @ * @ Int @ Int @~ ... ... $wmyfun @ Int $fEqInt ($d~~ `cast` ...) y wild mything1 ipv7 mything2 ww1 w9) ... }}} `SPECIALISE`d `val ~ Int` version: {{{ myfun y wild mything1 ipv7 mything2 ww1 }}} I would have expect that after the simplifier has run, GHC would have used all information it has to produce the best core (e.g. use the fact that it knows `val ~ Int` even without `SPECIALISE`), but it did not. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14941#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): I think I see. Can you offer a small example? I think something like this is happening. We have {{{ ...(\(g :: Int ~ val). ...f (d |> Num g)... ) ... where d :: Num Int }}} Now, we can't float that call up to the definition of `f` because it mentions `g`. But we could in principle specialise `f` for `Num Int`, and then use that specialised version at the call. I'm not certain this is exactly what is happening for you. Hence wanting a test case. (You don't need to exhibit worse perf; just that a specialisation is created and used without the equality, but not with.) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14941#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: davide Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by davide): * owner: (none) => davide -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14941#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14941: Switching direct type family application to EqPred (~) prevents inlining in
code using vector (10x slowdown)
-------------------------------------+-------------------------------------
Reporter: nh2 | Owner: davide
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 8.2.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by davide):
I've created a simple case where this happens:
{{{#!haskell
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeFamilies #-}
module F (f) where
f :: Int -> Int -- FAST
-- f :: forall a. (a ~ Int) => a -> a -- SLOW
f x = x + x
{-# NOINLINE f #-}
}}}
Compiling with:
{{{
$ ghc-8.4.3 -O -ddump-simpl -dsuppress-coercions F.hs
}}}
I get this core (note worker-wrapper transformation):
{{{
-- RHS size: {terms: 4, types: 1, coercions: 0, joins: 0/0}
F.$wf [InlPrag=NOINLINE] :: GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=1, Caf=NoCafRefs, Str=]
F.$wf = \ (ww_s1ay :: GHC.Prim.Int#) -> GHC.Prim.+# ww_s1ay ww_s1ay
-- RHS size: {terms: 10, types: 4, coercions: 0, joins: 0/0}
f [InlPrag=NOUSERINLINE[0]] :: Int -> Int
[GblId,
Arity=1,
Caf=NoCafRefs,
Str=m,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=1,unsat_ok=True,boring_ok=False)
Tmpl= \ (w_s1av [Occ=Once!] :: Int) ->
case w_s1av of { GHC.Types.I# ww1_s1ay [Occ=Once] ->
case F.$wf ww1_s1ay of ww2_s1aC { __DEFAULT ->
GHC.Types.I# ww2_s1aC
}
}}]
f = \ (w_s1av :: Int) ->
case w_s1av of { GHC.Types.I# ww1_s1ay ->
case F.$wf ww1_s1ay of ww2_s1aC { __DEFAULT ->
GHC.Types.I# ww2_s1aC
}
}
}}}
Swapping to `f :: forall a. (a ~ Int) => a -> a` gives:
{{{
-- RHS size: {terms: 10, types: 21, coercions: 12, joins: 0/0}
f [InlPrag=NOINLINE] :: forall a. ((a :: *) ~ (Int :: *)) => a -> a
[GblId, Arity=2, Caf=NoCafRefs, Str=m]
f = \ (@ a_a13U)
($d~_a13W :: (a_a13U :: *) ~ (Int :: *))
(eta_B1 :: a_a13U) ->
case GHC.Types.HEq_sc
@ * @ * @ a_a13U @ Int ($d~_a13W `cast` Co:5)
of co_a14c
{ __DEFAULT ->
(GHC.Num.$fNumInt_$c+
(eta_B1 `cast` Co:2) (eta_B1 `cast` Co:2))
`cast` Co:3
}
}}}
I ran a benchmark to confirm the performance difference:
{{{
benchmarking Int -> Int
time 12.47 ns (12.43 ns .. 12.55 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 12.53 ns (12.48 ns .. 12.59 ns)
std dev 173.4 ps (140.2 ps .. 239.2 ps)
variance introduced by outliers: 17% (moderately inflated)
benchmarking (a ~ Int) => a -> a
time 15.72 ns (15.69 ns .. 15.76 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 15.80 ns (15.74 ns .. 16.01 ns)
std dev 327.8 ps (135.0 ps .. 691.2 ps)
variance introduced by outliers: 32% (moderately inflated)
}}}
--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14941#comment:7
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

#14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: davide Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by davide): * Attachment "simple_case.zip" added. Simple case and benchmark. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14941 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: davide Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by davide): == Regarding simple example `f :: forall a. (a ~ Int) => a -> a`, the difference in performance is somewhat expected. This may be a different issue than the example given in the ticket description. In short, `a ~ Int` is a proof that type `a` is equal to type `Int`. In core, `a ~ Int` is a regular ''boxed'' GADT meaning that it could be bottom i.e. an invalid prove (this is the main mechanism behind [https://downloads.haskell.org/~ghc/8.0.2/docs/html/users_guide/glasgow_exts.... #deferring-type-errors-to-runtime -fdefer-type-errors]). Unboxing `a ~ b` at corresponds to checking the proof which is required to coerce the input binding from `a` to `Int`. Normally the `(a ~ Int)` would be optimized away (as described [http://dreixel.net/research/pdf/epdtecp.pdf here] in section 7.3), but that requires a worker wrapper transformation that never happens. Removing `NOINLINE` allows `f` to be optimized across modules, which closes the performance gap. == Regarding original example Unlike my simple example, all the code is in one module, so I expect the equality proof `VG.Mutable v ~ vm` to be optimized away (again see [http://dreixel.net/research/pdf/epdtecp.pdf here] section 7.3). With ghc 3.2.2, when compiling the slow version, I see `selectVectorDestructive2` is specialized to `$sselectVectorDestructive2 :: Int -> Vector Int -> MVector (PrimState IO) Int -> Int -> Int -> IO ()` (pass 2). This is good, but for some reason myread and partitionLoop2 are not specialized even though they are used by `$sselectVectorDestructive2`: {{{#!haskell $sselectVectorDestructive2 = ... let $dMVector = Data.Vector.Generic.Base.$p1Vector @Vector @Int Data.Vector.Unboxed.Base.$fVectorVectorInt in ... (Main.myread @IO @MVector @Int Control.Monad.Primitive.$fPrimMonadIO $dMVector GHC.Classes.$fOrdInt GHC.Show.$fShowInt v begin) ... (Main.partitionLoop2 @IO @MVector @Int Control.Monad.Primitive.$fPrimMonadIO $dMVector GHC.Classes.$fOrdInt GHC.Show.$fShowInt v begin pivot (GHC.Types.I# ...) }}} In the fast version, myread and partitionLoop2 are specialized in this pass. I noticed 2 other differences: * fast version floats `$dMVector` to a top level binding. * fast version specializes to `Mutable Vector (PrimState IO) Int` instead of `MVector (PrimState IO) Int`. Note `Mutable` is a type family and `Mutable Vector = MVector` -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14941#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: davide Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Thanks for the simpler example. I think I now understand what is going on. It's all to do with strictness analysis. Consider this function, and suppose it is strict in x: {{{ f1 :: Int -> blah f1 x = ...(case x of I# y -> ...)... }}} The strictness analysis sees that `f1` is strict in `x` and we get a w/w split {{{ f1 :: Int -> blah f1 x = case x of I# y -> $wf y $wf1 :: Int# -> blah $wf1 y = let x = I# y in ...original body of f... }}} Now suppose that we have {{{ type family F a type instance F Bool = Int f2 :: F Bool -> blah f2 x = ...same as before... }}} In fact the strictness analysis still sees that `f2` is strict in `x`, and worker wrapper works too -- all by using the family instances. We get this {{{ f2 :: F Bool -> blah f2 x = case (x |> g1) of I# y -> $wf2 y $wf2 :: Int# -> blah $wf2 y = let x = (I# y) |> sym g in ..as before... }}} Here `g` is a coercion `g :: F Bool ~ Int`, constructed from the family instances. This coersionn is generated by `topNormaliseType_maybe`, itself called from `deepSplitCprType_maybe` in `WwLib`, during worker/wrapper generation. But it's harder if the coercion is purely local. Let's start with this (yes I know you can't write `(~#)` in source Haskell but imagine this is Core: {{{ f3 :: forall a. (a ~# Int) => a -> blah f3 a (g :: (a ~# Int) (x :: a) = ...as before... }}} What we'd like is this: {{{ f3 :: forall a. (a ~# Int) => a -> blah f3 a (g :: (a ~# Int) (x :: a) = case (x |> g) of I# y -> $wf3 a g y $wf3 :: forall a. (a ~# Int) => Int# -> blah $wf3 a g y = let x = (I# y) |> sym g in ...as before... }}} but alas neither the demand analyser, nor the worker/wrapper generator are clever enough to do this. We need a kind of `normaliseType` that can take some local "givens" as well as the top-level family instance envs. This has the same ring as something Ryan wanted in the coverage checker. It's not obvious exactly what "should work". Eg what about `(Int ~# a)`, or even `([Int] ~# [a])`? Finally, there is the question of `(~)` vs `(~#)`. I think we are very aggressive about unboxing `(~)` constraints, so I'd like this: {{{ f4 :: forall a. (a ~ Int) => a -> blah f4 a (g :: (a ~ Int) (x :: a) = case g of MkTwiddle (g2 :: a ~# Int) -> case (x |> g2) of I# y -> $wf4 a g2 y $wf4 :: forall a. (a ~# Int) => Int# -> blah $wf4 a g2 y = let x = (I# y) |> sym g2 g = MkTwiddle g2 in ...as before... }}} Making all this happen will take some work though. How important is it. I'm tempted to say "don't write types like that"! (Although the type checker goes to some trouble to support them well.) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14941#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: davide Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by nh2): Replying to [comment:9 simonpj]:
How important is it? I'm tempted to say "don't write types like that"!
From the user's perspective, I can put forward two use cases: 1. Type level let bindings `(..., VG.Mutable v ~ vm) => ... -> vm (PrimState m) e -> ...` From the issue description. This is essentially type-signature `let`-binding, allowing me to write the type signature more clearly/legibly, especially if `vm` is used multiple times. 2. Refactorings For the code in comment 4, `(val ~ Int) =>`, this arose from a refactoring where I replaced `Int` by `val` across a large code base, and then in the top-level-ish function set `val ~ Int` to check if the performance was unimpacted. I'd probably be OK without this being fixed, now that I know what's going on, but if it's not fixed, we somehow have to do a much better job at giving warnings or educating people that you can't just apply a substitution principle when "cleaning up" type signaturees with `~`. I incorrectly assumed this would be type-level only, and as a result had to start a multi-day investigation of where my performance went because I expected any mistake on my side but not this. Especially when writing code using `vector` whose type family heavy API, in my opinion, almost begs for `~` to be used to achieve readable signatures. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14941#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC