Re: [GHC] #14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown)

18 Sep 2018

      #14941: Switching direct type family application to EqPred (~) prevents inlining in
code using vector (10x slowdown)
-------------------------------------+-------------------------------------
        Reporter:  nh2               |                Owner:  davide
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  8.2.2
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by davide):

 == Regarding simple example

 `f :: forall a. (a ~ Int) => a -> a`, the difference in performance is
 somewhat expected. This may be a different issue than the example given in
 the ticket description. In short, `a ~ Int` is a proof that type `a` is
 equal to type `Int`. In core, `a ~ Int` is a regular ''boxed'' GADT
 meaning that it could be bottom i.e. an invalid prove (this is the main
 mechanism behind
 [https://downloads.haskell.org/~ghc/8.0.2/docs/html/users_guide/glasgow_exts....
 #deferring-type-errors-to-runtime -fdefer-type-errors]). Unboxing `a ~ b`
 at corresponds to checking the proof which is required to coerce the input
 binding from `a` to `Int`. Normally the `(a ~ Int)` would be optimized
 away (as described [http://dreixel.net/research/pdf/epdtecp.pdf here] in
 section 7.3), but that requires a worker wrapper transformation that never
 happens. Removing `NOINLINE` allows `f` to be optimized across modules,
 which closes the performance gap.

 == Regarding original example

 Unlike my simple example, all the code is in one module, so I expect the
 equality proof `VG.Mutable v ~ vm` to be optimized away (again see
 [http://dreixel.net/research/pdf/epdtecp.pdf here] section 7.3). With ghc
 3.2.2, when compiling the slow version, I see `selectVectorDestructive2`
 is specialized to
 `$sselectVectorDestructive2 :: Int -> Vector Int -> MVector (PrimState IO)
 Int -> Int -> Int -> IO ()` (pass 2). This is good, but for some reason
 myread and partitionLoop2 are not specialized even though they are used by
 `$sselectVectorDestructive2`:
 {{{#!haskell
 $sselectVectorDestructive2 =
 ...
     let

         $dMVector =
           Data.Vector.Generic.Base.$p1Vector
             @Vector
             @Int
             Data.Vector.Unboxed.Base.$fVectorVectorInt
     in
 ...
           (Main.myread
                 @IO
                 @MVector
                 @Int
                 Control.Monad.Primitive.$fPrimMonadIO
                 $dMVector
                 GHC.Classes.$fOrdInt
                 GHC.Show.$fShowInt
                 v
                 begin)
 ...
           (Main.partitionLoop2
             @IO
             @MVector
             @Int
             Control.Monad.Primitive.$fPrimMonadIO
             $dMVector
             GHC.Classes.$fOrdInt
             GHC.Show.$fShowInt
             v
             begin
             pivot
             (GHC.Types.I# ...)
 }}}

 In the fast version, myread and partitionLoop2 are specialized in this
 pass. I noticed 2 other differences:
 * fast version floats `$dMVector` to a top level binding.
 * fast version specializes to `Mutable Vector (PrimState IO) Int` instead
 of `MVector (PrimState IO) Int`. Note `Mutable` is a type family and
 `Mutable Vector = MVector`

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14941#comment:8
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler