Re: [GHC] #14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown)

19 Mar 2018

      #14941: Switching direct type family application to EqPred (~) prevents inlining in
code using vector (10x slowdown)
-------------------------------------+-------------------------------------
        Reporter:  nh2               |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  8.2.2
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Description changed by nh2:

Old description:
...
{{{#!hs
selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e,
VG.Mutable v ~ vm)
  => Int -> v e ->           vm (PrimState m) e -> Int -> Int -> m ()
-- slow
selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e)
  => Int -> v e -> VG.Mutable v (PrimState m) e -> Int -> Int -> m ()
-- fast
}}}
These two functions are identical except one has `VG.Mutable v ~ vm` as a
constraint, the other one has it in the type signature right of the `=>`.
The second function is 10x faster.
I would expect them to be equally fast.
The code of the functions is identical, I change only the type
declaration.
The slowness happens because with the first function, inlining of
primitives like `unsafeRead` does not happen, and thus also it boxes the
`Int#`s back to `Int`s when calling `unsafeRead`.
In particular, in `-ddump-simpl`, the slow version has
{{{#!hs
$wpartitionLoop2_rgEy
  :: forall (m :: * -> *) (vm :: * -> * -> *) e.
     (PrimMonad m, MVector vm e, Ord e) =>
     vm (PrimState m) e
     -> GHC.Prim.Int#
     -> e
     -> GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> m Int
$wpartitionLoop2_rgEy
  = \... ->
          let {
            endIndex_a7Ay :: Int
            ...
            endIndex_a7Ay = GHC.Types.I# ww2_sfZn } in
  ...
                                     (VGM.basicUnsafeRead
                                        ...
                                        endIndex_a7Ay)
  ...
}}}
while the fast version has
{{{#!hs
$w$spartitionLoop2_rgUN
  :: GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> GHC.Prim.MutableByteArray# GHC.Prim.RealWorld
     -> GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #)
...
$spartitionLoop2_rgUP
  :: VG.Mutable VU.Vector (PrimState IO) Int
     -> Int
     -> Int
     -> Int
     -> Int
     -> Int
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #)
}}}
So with `VG.Mutable v (PrimState m) e` in the type signature, GHC managed
to inline + specialise it all the way down to concrete types
(`VUM.MVector`, `IO`, and `Int` as the element type), and consequently
inline `basicUnsafeRead` on unboxed `Int#`.
But with `VG.Mutable v ~ vm`, ghc keeps `vm (PrimState m) e` all the way,
passes around dictionaries, thus eventually cannot inline
`basicUnsafeRead` and re-packs already unboxed values, like
`endIndex_a7Ay = GHC.Types.I# ww2_sfZn`, before passing them into the
non-inlined call of `basicUnsafeRead`, thus making a tight loop allocate
that normally wouldn't allocate.
Why might rewriting the type signature in such a trivial way make this
happen?
----
I have tested this on GHC 8.0.2, GHC 8.2.2, and GHC 8.5 HEAD commit
cc4677c36ee.
Reproducer:
* https://github.com/nh2/haskell-quickselect-median-of-
medians/blob/0efd6293e779fda2d864ec3d75329fb16b8af6d9/Main.hs#L506
* Running instructions are [https://github.com/nh2/haskell-quickselect-
median-of-medians/commit/7a49d673990dfaebdb0ba837c3fbbaae0455dba0 in this
commit message]; for short: `stack exec -- ghc -O --make Main.hs -rtsopts
-ddump-simpl -dsuppress-coercions -fforce-recomp -ddump-to-file -fno-
full-laziness && ./Main +RTS -sstderr`
* For that file, I have pregenerated `-dverbose-core2core` output here:
https://github.com/nh2/haskell-quickselect-median-of-
medians/tree/0efd6293e779fda2d864ec3d75329fb16b8af6d9/slowness-analysis
* I originally just wanted to write a fast median-of-medians
implementation on ZuriHac 2017, but got totally derailed by this
performance problem. The version of it that I link here is a total mess
because of me mauling it to track down this performance issue.
Trying to increase `-funfolding-use-threshold` or `-funfolding-keeness-
factor` did not change the situation.
New description:

 {{{#!hs
 selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e,
 VG.Mutable v ~ vm)
   => Int -> v e ->           vm (PrimState m) e -> Int -> Int -> m ()
 -- slow

 selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e)
   => Int -> v e -> VG.Mutable v (PrimState m) e -> Int -> Int -> m ()
 -- fast
 }}}

 These two functions are identical except one has `VG.Mutable v ~ vm` as a
 constraint, the other one has it in the type signature right of the `=>`.

 The second function is 10x faster.

 I would expect them to be equally fast.

 The code of the functions is identical, I change only the type
 declaration.

 The slowness happens because with the first function, inlining of
 primitives like `unsafeRead` does not happen, and thus also it boxes the
 `Int#`s back to `Int`s when calling `unsafeRead`.

 In particular, in `-ddump-simpl`, the slow version has

 {{{#!hs
 $wpartitionLoop2_rgEy
   :: forall (m :: * -> *) (vm :: * -> * -> *) e.
      (PrimMonad m, MVector vm e, Ord e) =>
      vm (PrimState m) e
      -> GHC.Prim.Int#
      -> e
      -> GHC.Prim.Int#
      -> GHC.Prim.Int#
      -> GHC.Prim.Int#
      -> m Int
 $wpartitionLoop2_rgEy
   = \... ->
           let {
             endIndex_a7Ay :: Int
             ...
             endIndex_a7Ay = GHC.Types.I# ww2_sfZn } in
   ...
                                      (VGM.basicUnsafeRead
                                         ...
                                         endIndex_a7Ay)
   ...
 }}}

 while the fast version has

 {{{#!hs
 $w$spartitionLoop2_rgUN
   :: GHC.Prim.Int#
      -> GHC.Prim.Int#
      -> GHC.Prim.MutableByteArray# GHC.Prim.RealWorld
      -> GHC.Prim.Int#
      -> GHC.Prim.Int#
      -> GHC.Prim.Int#
      -> GHC.Prim.Int#
      -> GHC.Prim.Int#
      -> GHC.Prim.State# GHC.Prim.RealWorld
      -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #)

 ...

 $spartitionLoop2_rgUP
   :: VG.Mutable VU.Vector (PrimState IO) Int
      -> Int
      -> Int
      -> Int
      -> Int
      -> Int
      -> GHC.Prim.State# GHC.Prim.RealWorld
      -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #)
 }}}

 So with `VG.Mutable v (PrimState m) e` in the type signature, GHC managed
 to inline + specialise it all the way down to concrete types
 (`VUM.MVector`, `IO`, and `Int` as the element type), and consequently
 inline `basicUnsafeRead` on unboxed `Int#`.

 But with `VG.Mutable v ~ vm`, ghc keeps `vm (PrimState m) e` all the way,
 passes around dictionaries, thus eventually cannot inline
 `basicUnsafeRead` and re-packs already unboxed values, like `endIndex_a7Ay
 = GHC.Types.I# ww2_sfZn`, before passing them into the non-inlined call of
 `basicUnsafeRead`, thus making a tight loop allocate that normally
 wouldn't allocate.

 Why might rewriting the type signature in such a trivial way make this
 happen?

 ----

 I have tested this on GHC 8.0.2, GHC 8.2.2, and GHC 8.5 HEAD commit
 cc4677c36ee (edit: in which case I commented out the quickcheck and
 criterion related stuff because their deps don't build there yet).

 Reproducer:

 * https://github.com/nh2/haskell-quickselect-median-of-
 medians/blob/0efd6293e779fda2d864ec3d75329fb16b8af6d9/Main.hs#L506
 * Running instructions are [https://github.com/nh2/haskell-quickselect-
 median-of-medians/commit/7a49d673990dfaebdb0ba837c3fbbaae0455dba0 in this
 commit message]; for short: `stack exec -- ghc -O --make Main.hs -rtsopts
 -ddump-simpl -dsuppress-coercions -fforce-recomp -ddump-to-file -fno-full-
 laziness && ./Main +RTS -sstderr`
 * For that file, I have pregenerated `-dverbose-core2core` output here:
 https://github.com/nh2/haskell-quickselect-median-of-
 medians/tree/0efd6293e779fda2d864ec3d75329fb16b8af6d9/slowness-analysis
 * I originally just wanted to write a fast median-of-medians
 implementation on ZuriHac 2017, but got totally derailed by this
 performance problem. The version of it that I link here is a total mess
 because of me mauling it to track down this performance issue.

 Trying to increase `-funfolding-use-threshold` or `-funfolding-keeness-
 factor` did not change the situation.

--

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14941#comment:2
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler