[GHC] #11735: Optimize coercionKind

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: None/Unknown Unknown/Multiple | Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- Currently, the `ForAllCo` case of `coercionKind` does an inefficient one- variable substitution. This could be improved by looking for nested `ForAllCo`s. Furthermore, perhaps we don't need an in-scope set and the full substitution machinery here. The subst is simply propagating the update of a tyvar's kind. No structural changes at all. No need for smart coercion constructors or other processing. So if we're going to optimize this, it might be worth making a specialized version of `subst_ty` and `subst_co` that operate over a `VarEnv Var` instead of a full substitution. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): This originated from commentary on Phab:D2024, though you don't have to read that to understand this ticket. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): see #11598, which is closely related. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * cc: tdammers (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): This could be a useful building block, if not the solution, for #14683. Any concrete ideas / plans on how to implement this? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Let's just do this. Pretty easy. Roughly like the `InstCo` case. Here is an untested patch for `coercionKind`. {{{ diff --git a/compiler/types/Coercion.hs b/compiler/types/Coercion.hs index 3f83b09..daebf35 100644 --- a/compiler/types/Coercion.hs +++ b/compiler/types/Coercion.hs @@ -1707,17 +1707,13 @@ coercionKind co = go co go (Refl _ ty) = Pair ty ty go (TyConAppCo _ tc cos)= mkTyConApp tc <$> (sequenceA $ map go cos) go (AppCo co1 co2) = mkAppTy <$> go co1 <*> go co2 - go (ForAllCo tv1 k_co co) - = let Pair _ k2 = go k_co - tv2 = setTyVarKind tv1 k2 - Pair ty1 ty2 = go co - subst = zipTvSubst [tv1] [TyVarTy tv2 `mk_cast_ty` mkSymCo k_co] - ty2' = substTyAddInScope subst ty2 in - -- We need free vars of ty2 in scope to satisfy the invariant - -- from Note [The substitution invariant] - -- This is doing repeated substitutions and probably doesn't - -- need to, see #11735 - mkInvForAllTy <$> Pair tv1 tv2 <*> Pair ty1 ty2' + + go co@(ForAllCo tv1 k_co co1) + | isReflCo k_co = mkInvForAllTy tv1 <$> go co1 + | otherwise = go_forall empty_subst co + where + empty_subst = mkEmptyTCvSubst (mkInScopeSet $ tyCoVarsOfCo co) + go (FunCo _ co1 co2) = mkFunTy <$> go co1 <*> go co2 go (CoVarCo cv) = coVarTypes cv go (HoleCo h) = coVarTypes (coHoleCoVar h) @@ -1769,10 +1765,16 @@ coercionKind co = go co go_app (InstCo co arg) args = go_app co (arg:args) go_app co args = piResultTys <$> go co <*> (sequenceA $ map go args) - -- The real mkCastTy is too slow, and we can easily have nested ForAllCos. - mk_cast_ty :: Type -> Coercion -> Type - mk_cast_ty ty (Refl {}) = ty - mk_cast_ty ty co = CastTy ty co + go_forall subst (ForAllCo tv1 k_co co) + = mkInvForAllTy <$> Pair tv1 tv2 <*> go_forall subst' co + where + Pair _ k2 = go k_co + tv2 = setTyVarKind tv1 (substTy subst k2) + subst' | isReflCo k_co = extendTCvInScope subst tv1 + | otherwise = extendTvSubst (extendTCvInScope subst tv2) tv1 $ + TyVarTy tv2 `mkCastTy` mkSymCo k_co + go_forall subst other_co + = substTy subst `pLiftSnd` go other_co -- | Apply 'coercionKind' to multiple 'Coercion's coercionKinds :: [Coercion] -> Pair [Type] }}} '''Richard: can you check'''. We should do the same thing to `coercionKindRole`. (Tiresomely.) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):
We should do the same thing to coercionKindRole. (Tiresomely.)
Actually, looking at it, I think it'd be better to define {{{ coercionRole :: Coercion -> Role }}} directly (a simple, fast recursive function), and then define {{{ coercionKindRole :: Coercion -> (Pair Type, Role) coercionKindRole co = (coercionKind co, coercionRole co) }}} Less duplication, and (I strongly suspect) faster. Could you try that and check perf? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Applying the `coercionKind` patch brings down compilation time for the `Grammar.hs` example from #14683 to about 30 seconds. Relevant profiling data: {{{ Wed Jan 24 14:33 2018 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -RTS -B/home/tobias/well- typed/devel/ghc/inplace/lib Grammar.hs -ddump-stg -ddump-simpl -ddump-to- file -fforce-recomp total time = 32.48 secs (32475 ticks @ 1000 us, 1 processor) total alloc = 43,424,804,608 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 20.3 22.6 subst_ty TyCoRep compiler/types/TyCoRep.hs:2225:28-32 17.7 23.9 Stg2Stg HscMain compiler/main/HscMain.hs:1489:12-44 13.9 16.3 simplCast Simplify compiler/simplCore/Simplify.hs:871:62-87 12.8 10.7 simplCast-addCoerce Simplify compiler/simplCore/Simplify.hs:1225:53-71 11.6 9.2 tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7) 6.4 4.6 addCoerce-pushCoTyArg Simplify compiler/simplCore/Simplify.hs:(1236,12)-(1237,72) 5.0 4.3 preprocessFile GhcMake compiler/main/GhcMake.hs:(2416,1)-(2443,37) 1.0 0.0 coercionKind Coercion compiler/types/Coercion.hs:1707:3-7 1.0 1.9 }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:6 simonpj]:
We should do the same thing to coercionKindRole. (Tiresomely.)
Actually, looking at it, I think it'd be better to define {{{ coercionRole :: Coercion -> Role }}} directly (a simple, fast recursive function), and then define {{{ coercionKindRole :: Coercion -> (Pair Type, Role) coercionKindRole co = (coercionKind co, coercionRole co) }}} Less duplication, and (I strongly suspect) faster. Could you try that and check perf?
I'm not sure I understand what `coercionRole` would look like. Also, does that mean that the inline notes in the existing `coercionRoleKinds` and `coercionRole` functions don't apply anymore? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Oh wait, I think I get it. Never mind. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by simonpj): * cc: goldfire (added) Comment: Great. Next step: 1. We should not spend 20% our time in `CoreTidy`, I think. Drill down. 2. Something is clearly wrong with `Simplify.simplCast`. I think I know what it is. Given {{{ (fun |> co) @t1 @t2 ... @tn }}} we will call `pushCoTyArg` `n` times, and hence does `n` singleton substitutions, via the `n` calls to `piResultTy`. Solution: gather up those type arguments (easy) and define {{{ pushCoTyArgs :: Coercion -> [Type] -> Maybe ([Type], Coercion) }}} using an accumulating substitution. I spent a few minutes trying to write down `pushCoTyArgs` but my brain melted in a mess of kind casts, and I ran out of time. '''Richard''' can you help with (2), please? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by simonpj): * failure: None/Unknown => Compile-time performance bug -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Refactoring `coercionRoleKinds` / `coercionRole` gives us another performance boost: {{{ Wed Jan 24 17:17 2018 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -RTS -B/home/tobias/well- typed/devel/ghc/inplace/lib Grammar.hs -ddump-stg -ddump-simpl -ddump-to- file -fforce-recomp total time = 24.18 secs (24176 ticks @ 1000 us, 1 processor) total alloc = 29,250,375,752 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 24.8 28.5 Stg2Stg HscMain compiler/main/HscMain.hs:1489:12-44 20.1 24.2 simplCast Simplify compiler/simplCore/Simplify.hs:871:62-87 18.6 15.9 tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7) 8.1 6.9 addCoerce-pushCoTyArg Simplify compiler/simplCore/Simplify.hs:(1236,12)-(1237,72) 7.3 6.4 subst_ty TyCoRep compiler/types/TyCoRep.hs:2225:28-32 4.2 5.1 preprocessFile GhcMake compiler/main/GhcMake.hs:(2416,1)-(2443,37) 1.4 0.0 coercionKind Coercion compiler/types/Coercion.hs:1707:3-7 1.4 3.0 zonkTopDecls TcRnDriver compiler/typecheck/TcRnDriver.hs:(445,16)-(446,43) 1.3 1.1 simplExprF1-Lam Simplify compiler/simplCore/Simplify.hs:896:5-39 1.0 1.1 }}} Now on to the 25% spent on CoreTidy. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Refactoring coercionRoleKinds / coercionRole gives us another
#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): performance boost That is totally bonkers (in a good way). From 43G down to 29G from that one change. Wow. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): I'm dubious of the `coercionRole` refactor, as it undoes a refactoring I put in a few years ago, for performance reasons. The problem is that `coercionRole` depends on `coercionKind` in the `NthCo` case. So, if these functions are separated, then `coercionRole` recurs via both `coercionKind` and `coercionRole` sometimes, causing a lot of extra work. Note that a standalone `coercionKind` does exist, also for performance reasons. To be clear, I'm not doubting your numbers on your particular test case, but I'm not sure how far this would generalize. That said, the degree to which your case improved suggests there's some other inefficiency here. Maybe it's all the tuples? If we make them unboxed, does that fix the problem? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): I can take a look at comment:10 on Friday, but not before. Also, tdammers, thanks thanks thanks for doing this! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

I'm dubious of the coercionRole refactor, as it undoes a refactoring I
#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): put in a few years ago, for performance reasons. I looked at what code would be needed for `coercionRole` and its a remarkably short and simple function. Tobias, would to like to post the code? Richard, did you have a concrete reason for that refactoring? Or just a general worry? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

I'm dubious of the coercionRole refactor, as it undoes a refactoring I
#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:16 simonpj]: put in a few years ago, for performance reasons.
I looked at what code would be needed for `coercionRole` and its a
remarkably short and simple function. Tobias, would to like to post the code? Of course. Here's the patch: {{{ diff --git a/compiler/types/Coercion.hs b/compiler/types/Coercion.hs index e1a5b7cde0..36874a4c4d 100644 --- a/compiler/types/Coercion.hs +++ b/compiler/types/Coercion.hs @@ -1783,77 +1783,53 @@ coercionKinds tys = sequenceA $ map coercionKind tys -- | Get a coercion's kind and role. -- Why both at once? See Note [Computing a coercion kind and role] coercionKindRole :: Coercion -> (Pair Type, Role) -coercionKindRole = go +coercionKindRole co = (coercionKind co, coercionRole co) + +-- | Retrieve the role from a coercion. +coercionRole :: Coercion -> Role +coercionRole = go where - go (Refl r ty) = (Pair ty ty, r) - go (TyConAppCo r tc cos) - = (mkTyConApp tc <$> (sequenceA $ map coercionKind cos), r) - go (AppCo co1 co2) - = let (tys1, r1) = go co1 in - (mkAppTy <$> tys1 <*> coercionKind co2, r1) - go (ForAllCo tv1 k_co co) - = let Pair _ k2 = coercionKind k_co - tv2 = setTyVarKind tv1 k2 - (Pair ty1 ty2, r) = go co - subst = zipTvSubst [tv1] [TyVarTy tv2 `mkCastTy` mkSymCo k_co] - ty2' = substTyAddInScope subst ty2 in - -- We need free vars of ty2 in scope to satisfy the invariant - -- from Note [The substitution invariant] - -- This is doing repeated substitutions and probably doesn't - -- need to, see #11735 - (mkInvForAllTy <$> Pair tv1 tv2 <*> Pair ty1 ty2', r) - go (FunCo r co1 co2) - = (mkFunTy <$> coercionKind co1 <*> coercionKind co2, r) + go (Refl r _) = r + go (TyConAppCo r _ _) = r + go (AppCo co1 _) = go co1 + go (ForAllCo _ _ co) = go co + go (FunCo r _ _) = r go (CoVarCo cv) = go_var cv go (HoleCo h) = go_var (coHoleCoVar h) - go co@(AxiomInstCo ax _ _) = (coercionKind co, coAxiomRole ax) - go (UnivCo _ r ty1 ty2) = (Pair ty1 ty2, r) - go (SymCo co) = first swap $ go co - go (TransCo co1 co2) - = let (tys1, r) = go co1 in - (Pair (pFst tys1) (pSnd $ coercionKind co2), r) + go (AxiomInstCo ax _ _) = coAxiomRole ax + go (UnivCo _ r _ _) = r + go (SymCo co) = go co + go (TransCo co1 co2) = go co1 go (NthCo d co) | Just (tv1, _) <- splitForAllTy_maybe ty1 = ASSERT( d == 0 ) - let (tv2, _) = splitForAllTy ty2 in - (tyVarKind <$> Pair tv1 tv2, Nominal) + Nominal | otherwise = let (tc1, args1) = splitTyConApp ty1 (_tc2, args2) = splitTyConApp ty2 in ASSERT2( tc1 == _tc2, ppr d $$ ppr tc1 $$ ppr _tc2 ) - ((`getNth` d) <$> Pair args1 args2, nthRole r tc1 d) + (nthRole r tc1 d) where - (Pair ty1 ty2, r) = go co - go co@(LRCo {}) = (coercionKind co, Nominal) + (Pair ty1 ty2, r) = coercionKindRole co + go (LRCo {}) = Nominal go (InstCo co arg) = go_app co [arg] - go (CoherenceCo co1 co2) - = let (Pair t1 t2, r) = go co1 in - (Pair (t1 `mkCastTy` co2) t2, r) - go co@(KindCo {}) = (coercionKind co, Nominal) - go (SubCo co) = (coercionKind co, Representational) - go co@(AxiomRuleCo ax _) = (coercionKind co, coaxrRole ax) + go (CoherenceCo co1 _) = go co1 + go (KindCo {}) = Nominal + go (SubCo _) = Representational + go (AxiomRuleCo ax _) = coaxrRole ax ------------- - go_var cv = (coVarTypes cv, coVarRole cv) + go_var = coVarRole ------------- - go_app :: Coercion -> [Coercion] -> (Pair Type, Role) + go_app :: Coercion -> [Coercion] -> Role -- Collect up all the arguments and apply all at once -- See Note [Nested InstCos] go_app (InstCo co arg) args = go_app co (arg:args) - go_app co args - = let (pair, r) = go co in - (piResultTys <$> pair <*> (sequenceA $ map coercionKind args), r) - --- | Retrieve the role from a coercion. -coercionRole :: Coercion -> Role -coercionRole = snd . coercionKindRole - -- There's not a better way to do this, because NthCo needs the *kind* - -- and role of its argument. Luckily, laziness should generally avoid - -- the need for computing kinds in other cases. + go_app co args = go co {- Note [Nested InstCos] }}} Or, for increased clarity, just the new version of `coercionRole`: {{{ -- | Retrieve the role from a coercion. coercionRole :: Coercion -> Role coercionRole = go where go (Refl r _) = r go (TyConAppCo r _ _) = r go (AppCo co1 _) = go co1 go (ForAllCo _ _ co) = go co go (FunCo r _ _) = r go (CoVarCo cv) = go_var cv go (HoleCo h) = go_var (coHoleCoVar h) go (AxiomInstCo ax _ _) = coAxiomRole ax go (UnivCo _ r _ _) = r go (SymCo co) = go co go (TransCo co1 co2) = go co1 go (NthCo d co) | Just (tv1, _) <- splitForAllTy_maybe ty1 = ASSERT( d == 0 ) Nominal | otherwise = let (tc1, args1) = splitTyConApp ty1 (_tc2, args2) = splitTyConApp ty2 in ASSERT2( tc1 == _tc2, ppr d $$ ppr tc1 $$ ppr _tc2 ) (nthRole r tc1 d) where (Pair ty1 ty2, r) = coercionKindRole co go (LRCo {}) = Nominal go (InstCo co arg) = go_app co [arg] go (CoherenceCo co1 _) = go co1 go (KindCo {}) = Nominal go (SubCo _) = Representational go (AxiomRuleCo ax _) = coaxrRole ax ------------- go_var = coVarRole ------------- go_app :: Coercion -> [Coercion] -> Role -- Collect up all the arguments and apply all at once -- See Note [Nested InstCos] go_app (InstCo co arg) args = go_app co (arg:args) go_app co args = go co }}} I didn't make an effort at completely understanding what this is supposed to do, really; what I did, in a nutshell, was: - Rename `coercionKindRole` to `coercionRole` and change its type - Remove the first tuple element from each returned value - Fix recursive calls and local functions to not take the now-redundant first parameter and not return the first tuple element - Write a new `coercionKindRole` function that simply calls `coercionKind` and `coercionRole` and tuples up the results So it's entirely possible that I did something wrong somewhere. Anyway, I don't know if this would qualify as "remarkably short", but it is fairly simple.
Richard, did you have a concrete reason for that refactoring? Or just a
general worry? I believe the explanation is in the note at the bottom: {{{ Note [Nested InstCos] ~~~~~~~~~~~~~~~~~~~~~ In Trac #5631 we found that 70% of the entire compilation time was being spent in coercionKind! The reason was that we had (g @ ty1 @ ty2 .. @ ty100) -- The "@s" are InstCos where g :: forall a1 a2 .. a100. phi If we deal with the InstCos one at a time, we'll do this: 1. Find the kind of (g @ ty1 .. @ ty99) : forall a100. phi' 2. Substitute phi'[ ty100/a100 ], a single tyvar->type subst But this is a *quadratic* algorithm, and the blew up Trac #5631. So it's very important to do the substitution simultaneously; cf Type.piResultTys (which in fact we call here). }}} I'm not entirely sure whether this still applies though; I would expect the separate `coercionRole` and `coercionKinds` functions to perform better individually than the combined one, except when both are actually needed in concert. And even then, I'm skeptical; the recursive calls would drag along both data structures (kinds and role). Is it possible that this particular bit of code looked crucially different back when #5631 was ongoing? Or did I un-refactor in a way that doesn't replicate a crucial problem from the original code? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Oh, and also this note further up: {{{ Note [Computing a coercion kind and role] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To compute a coercion's kind is straightforward: see coercionKind. But to compute a coercion's role, in the case for NthCo we need its kind as well. So if we have two separate functions (one for kinds and one for roles) we can get exponentially bad behaviour, since each NthCo node makes a separate call to coercionKind, which traverses the sub-tree again. This was part of the problem in Trac #9233. Solution: compute both together; hence coercionKindRole. We keep a separate coercionKind function because it's a bit more efficient if the kind is all you want. }}} Which leads me to believe that maybe this particular input I'm testing doesn't hit the `NthCo` branch enough to make a dent there, and thus fares better with the separated-out implementation, but other code might. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): * In your `coercionRole`, `get_app` never uses its second argument! So you can get rid of it entirely I think {{{ go (InstCo co _) = go co }}} * `Note [Nested InstCos]` applies to `coercionKind` but not, I think, to `coercionRole` * `Note [Computing a coercion kind and role]` claims that computing the result of `coercionRole (NthCo d co)` requires `coercionKind`. But it manifestly does not. '''Richard''' is this note simply wrong? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): Thanks for finding `Note [Computing a coercion kind and role]`. I believe this note is correct and is as relevant today as it was when I wrote it. And, I'm afraid to say that `coercionRole (NthCo ...)` manifestly '''does''' require `coercionKind`. See the call in the `where` clause in the suggested patch. However, perhaps the solution lies in including the role of an `NthCo` in the `NthCo`, essentially caching this result. I think this would be easy to do, especially as the caller of `mkNthCo` generally has to know what role its getting -- `mkNthCo` could take this role as a parameter. Lint could check that the supplied role is the same as what would have been calculated. With this change to `NthCo`, then `coercionRole` really would be independent of `coercionKind` and we could separate the functions. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:20 goldfire]:
Thanks for finding `Note [Computing a coercion kind and role]`. I believe this note is correct and is as relevant today as it was when I wrote it.
And, I'm afraid to say that `coercionRole (NthCo ...)` manifestly '''does''' require `coercionKind`. See the call in the `where` clause in the suggested patch.
Correct. The original `coercionKindRole` function recursed via the `go` worker directly, but now that I have rewritten it as `coercionRole`, the `NthCo` case still required the coercion kind to do its thing, so it calls `coercionKindRole`, which, post-unrefactoring, calls both `coercionKind` and `coercionRole`.
However, perhaps the solution lies in including the role of an `NthCo` in the `NthCo`, essentially caching this result. I think this would be easy to do, especially as the caller of `mkNthCo` generally has to know what role its getting -- `mkNthCo` could take this role as a parameter. Lint could check that the supplied role is the same as what would have been calculated. With this change to `NthCo`, then `coercionRole` really would be independent of `coercionKind` and we could separate the functions.
Sounds like a reasonable plan; however, before I venture into implementing this, I would love to have some hard evidence that we actually do have a problem. Do you have any single-module example code ready that I could use to prove that the "un-refactored" code performs significantly worse than current HEAD on some inputs? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:19 simonpj]:
* In your `coercionRole`, `get_app` never uses its second argument! So you can get rid of it entirely I think {{{ go (InstCo co _) = go co }}}
Good catch, hadn't noticed. Thanks!
* `Note [Computing a coercion kind and role]` claims that computing the result of `coercionRole (NthCo d co)` requires `coercionKind`. But it manifestly does not. '''Richard''' is this note simply wrong?
No, it's not wrong, see my earlier comment - `coercionRole` now recurses via `coercionKindRole` in the `where` clause. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):
I'm afraid to say that coercionRole (NthCo ...) manifestly does require coercionKind
Aha. So it does. Very well, yes let's do this * Cache the role (of the result of `NthCo i co`) in the `NthCo` data constructor * Make `mkNth` compute that cached role by calling `coercionKind` and `coercionRole` (in the final equation where we build `NthCo`). * Possible gloss: in some calls to `mkNthCo` we know the role, and we even know that the various short-cuts in `mkNthCo` won't succeed. So we can just use `NthCo` directly -- or define `mkNthCoDirect = NthCo` and call `mkNthCoDirect`. Perhaps not worth it; but when we do know the role it seems a bit silly not to use it. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:23 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

I would love to have some hard evidence that we actually do have a
#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): problem. Well, without caching we know that getting the role of `NthCo i1 (NthCo i2 (NthCo i3 ...))` will take time at least quadratic in the nesting depth. Caching cheaply avoids nasty corner case. And not much code is involved. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:24 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

I would love to have some hard evidence that we actually do have a
#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:24 simonpj]: problem.
Well, without caching we know that getting the role of `NthCo i1 (NthCo
i2 (NthCo i3 ...))` will take time at least quadratic in the nesting depth. Caching cheaply avoids nasty corner case. And not much code is involved. Fair enough, I'll get to it. Just wanted a test case for practical verification. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:25 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): The caching approach implemented directly (i.e., adding a `Role` field to the `NthCo` constructor) is a fairly pervasive one; it turns out that many modules depend on that constructor, either directly o via the isomorphic `IfaceCoercion` type (which, I believe, should inherit the additional field, correct me if I'm wrong), so adding the extra fields is going to touch a lot of source files. Not a fundamental problem per se, but it means that this will take a little while and will deserve some extra scrutinity before unleashing it. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:26 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): OK, new profiling result: {{{ Thu Jan 25 13:11 2018 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -RTS -B/home/tobias/well- typed/devel/ghc/inplace/lib Grammar.hs -ddump-stg -ddump-simpl -ddump-to- file -fforce-recomp total time = 20.99 secs (20989 ticks @ 1000 us, 1 processor) total alloc = 29,250,375,256 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 24.2 28.5 Stg2Stg HscMain compiler/main/HscMain.hs:1489:12-44 20.3 24.2 simplCast Simplify compiler/simplCore/Simplify.hs:871:62-87 18.7 15.9 tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7) 9.0 6.9 addCoerce-pushCoTyArg Simplify compiler/simplCore/Simplify.hs:(1236,12)-(1237,72) 7.4 6.4 subst_ty TyCoRep compiler/types/TyCoRep.hs:2237:28-32 4.3 5.1 zonkTopDecls TcRnDriver compiler/typecheck/TcRnDriver.hs:(445,16)-(446,43) 1.4 1.1 coercionKind Coercion compiler/types/Coercion.hs:1725:3-7 1.3 3.0 simplExprF1-Lam Simplify compiler/simplCore/Simplify.hs:896:5-39 1.0 1.1 }}} This is the same `Grammar.hs`, compiled with the same GHC code as before, but with `NthCo` extended with an extra `Role` field, and the kind calculation from `coercionRole` moved out into `mkNthCo`. I ended up having to make changes in 16 modules, but most of them were straightforward, discarding or forwarding the extra field in a pattern match. I think this shouldn't have any negative impact, because forwarding the role can only make things better (avoiding future calls to `coercionRole`), and discarding it retains the old status. Conclusions: - We're shaving off another 4 seconds of execution time, and allocations remain the same. So this doesn't seem to make things worse for the `Grammar.hs` case. - We are not actually reducing allocations any further. - CoreTidy is worth looking into. - In order to verify that this change really makes an impact for the better, I would still love to test this against source code that would perform badly without it. Test cases very welcome. - 20 seconds is still awfully long. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:27 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): You don't need to propagate the change into `IfaceCoercion`: it's just a cache, so you can recompute it when turning `IfaceNthCo` into `NthCo`. On the whole I think it'd be better not to change `IfaceCoercion`: less redundancy in the `Iface` data types. Shaving off 4 seconds is fantastic. Take it! Also: * Yes `CoreTidy` is worth looking into * But also so is `simplCast`. See comment:10. '''Richard''' I need your help! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:28 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): I still say that most uses of `mkNthCo` already know the right role -- though perhaps this role may not be as obvious to someone who isn't me. Just supplying the role would be much better than computing it, of course. I don't agree with Simon's suggestion about `mkNthCoDirect` -- this is a first step toward invariants that are not upheld. Instead, I would have `mkNthCo` require a role and make a new `mkNthCoNoRole` that computes it. The naming of the functions should discourage the use of the second. The Note explaining why I originally bundled `coercionKind` and `coercionRole` points to a test case. Does that give a concrete testing ground? You might also look in the git history around that note to see if that gives you any pointers. Agreed about not cluttering `IfaceCoercion`. And, yes, I will take a look at `simplCast` tomorrow. (Though it's deeply scary when Simon says "My brain melted" and then asks me to take a look!) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:29 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): By the way, if @tdammers can make the patch available (on GitHub, say), I could look into patching the patch. That might be easier. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:30 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):
Instead, I would have mkNthCo require a role and make a new mkNthCoNoRole that computes it.
Yes, but then `mkNthCo` relies on the caller to guarantee the invariant. That's fine; Lint will check. That's all I meant `mkNthCoDirect` to do. But I'm not fussy about naming, so long as the comments are clear. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:31 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

I still say that most uses of `mkNthCo` already know the right role --
#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:29 goldfire]: though perhaps this role may not be as obvious to someone who isn't me. Just supplying the role would be much better than computing it, of course. It is indeed a bit less obvious to me; I have found a few places (2 or 3 I think) where we have a role ready that could be a candidate, but not being sure whether it would be the right thing, I opted for the conservative option.
I don't agree with Simon's suggestion about `mkNthCoDirect` -- this is a first step toward invariants that are not upheld. Instead, I would have `mkNthCo` require a role and make a new `mkNthCoNoRole` that computes it. The naming of the functions should discourage the use of the second.
The Note explaining why I originally bundled `coercionKind` and `coercionRole` points to a test case. Does that give a concrete testing ground? You might also look in the git history around that note to see if
I have absolutely no opinion on this one; happy to implement it either way. that gives you any pointers. OK, will take a look.
Agreed about not cluttering `IfaceCoercion`.
Yes, makes total sense. Undoing as we speak. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:32 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:30 goldfire]:
By the way, if @tdammers can make the patch available (on GitHub, say), I could look into patching the patch. That might be easier.
Will do ASAP. I made a bit of a mess with git, but once that's cleaned up I'll upload somewhere (phab?) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:33 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Here's what I have so far: [https://github.com/ghc/ghc/tree/wip/tdammers/T11735] -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:34 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): I have a lingering concern: ''why'' did the old `coercionKindRole` perform so miserably? In a call to `coercionRole`, the kind calculations should never be forced. So what takes up all the memory? Is it really just the tuples? If so, then we've discovered a major way to speed up other areas of GHC: convert tuples to be unboxed. Even better, we've discovered a major missing optimization, which could probably automate the unboxing somehow. So I wonder if there are more opportunities here. None of this changes the current direction of travel (caching is a good idea, regardless of my question here), but perhaps suggests another future line of inquiry. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:35 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): The GHC branch is `wip/tdammers/T11735` -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:36 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

I have a lingering concern: ''why'' did the old `coercionKindRole`
So I wonder if there are more opportunities here. None of this changes
#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:35 goldfire]: perform so miserably? In a call to `coercionRole`, the kind calculations should never be forced. So what takes up all the memory? Is it really just the tuples? If so, then we've discovered a major way to speed up other areas of GHC: convert tuples to be unboxed. Even better, we've discovered a major missing optimization, which could probably automate the unboxing somehow. I think the hypothesis was something like this: Suppose we have a deeply nested `NthCo`, and we want to calculate its coercionKind. In order to do that, we need its coercionRole, which involves recursing through the whole thing (O(n)), but once we have that result, we only use it to decide whether we need to keep recursing, and then throw it away. And then in the next step, we calculate the coercionRole again. So the whole thing ends up as O(n²). Whereas if we cache the roles throughout, we only ever calculate each of them once, at construction time, so we never get the quadratic badness. So it's not that the role calculation forces the kind calculation, but the other way around - in order to calculate the correct kind for an NthCo, we need to know its role, but potentially also the roles of all of its children. So in a way, the caching serves as memoization. the current direction of travel (caching is a good idea, regardless of my question here), but perhaps suggests another future line of inquiry. Considering how the Grammar.hs example still takes about 20 seconds to compile, and there are a few rather whopping candidates popping up in the profile, yes, I think it is very likely that we can find other opportunities. I will definitely look into the tuple unboxing thing, and also try to get to the bottom of the CoreTidy and simplCast cost centres. Who knows, maybe they're somehow related, even. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:37 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): I'm sorry -- I don't understand the first part of comment:37. Getting a kind should never require getting a role. That's why there is a version of `coercionKind` that's a standalone function. Let's assume you got these two swapped. Even then, I'm not sure what you're describing; it seems you're describing your 'un-refactored" version keeping roles and kinds separate. If they are together (as in HEAD), I don't see the quadratic behavior. And yet, something goes terribly wrong in HEAD, even without this quadratic behavior. But what?? Or maybe I'm completely missing something here. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:38 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:38 goldfire]:
I'm sorry -- I don't understand the first part of comment:37. Getting a kind should never require getting a role. That's why there is a version of `coercionKind` that's a standalone function. Let's assume you got these two swapped. Even then, I'm not sure what you're describing; it seems you're describing your 'un-refactored" version keeping roles and kinds separate.
I'm sorry, yes, I was being confused there; `coercionKind` and `coercionRole` are mutually recursive in the "un-refactored" version only. However, the un-refactoring *does* produce a performance improvement, so there must be *something* going on here - I assumed that the original `coercionKindRole` would ultimately amount to a similar recursion pattern, but it probably doesn't.
If they are together (as in HEAD), I don't see the quadratic behavior. And yet, something goes terribly wrong in HEAD, even without this quadratic behavior. But what?? Or maybe I'm completely missing something here.
I think I have found it. For clarity, this is the relevant code on HEAD: {{{ go (NthCo d co) | Just (tv1, _) <- splitForAllTy_maybe ty1 = ASSERT( d == 0 ) let (tv2, _) = splitForAllTy ty2 in (tyVarKind <$> Pair tv1 tv2, Nominal) | otherwise = let (tc1, args1) = splitTyConApp ty1 (_tc2, args2) = splitTyConApp ty2 in ASSERT2( tc1 == _tc2, ppr d $$ ppr tc1 $$ ppr _tc2 ) ((`getNth` d) <$> Pair args1 args2, nthRole r tc1 d) where (Pair ty1 ty2, r) = go co }}} So the rough shape of the recursion is simple - we hit the `otherwise` case repeatedly until we get to the `d == 0` case; O(n). But inside the `otherwise` branch, there's this pesky `getNth` call, which is linear in `d` (being essentially a linked-list lookup), and another one via `nthRole`. The problem goes away when we calculate the role at construction time, because we are either constructing an NthCo that doesn't wrap another NthCo, which makes the role calculation constant-time, or we are constructing one that *does* wrap another NthCo, but that one already has its role calculated, so it is also constant. Hope I'm making sense now. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:39 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): Sorry, still confused. :( How are `coercionKind` and `coercionRole` ''mutually'' recursive? I see that `coercionRole` calls `coercionKind` but not the other way around. But you're right that I'm trying to understand better why there's a performance improvement in this patch (even before any caching). In the nested `NthCo` case, I'm pretty sure your refactor would be worse. But in the test case at hand (which I assume doesn't have nested `NthCo`s -- haven't looked), your change is clearly an improvement. However, I don't think your analysis above is really the problem. I would expect that the running time of `coercionKind` or `coercionRole` on nested `NthCo`s to be linear in the sum of the `d`s -- that is, we'll have to add together all the indices. You've shown above that the old recursion pattern (from `coercionKindRole`) traverses down the linked list twice (once in `getNth` and once in `nthRole`), but this shouldn't change asymptotic complexity. And, usually, `d` is quite small, and so I wouldn't expect this to show up at all, really. I still don't think we've quite gotten to the bottom of why separating out `coercionKind` and `coercionRole` should effect a performance improvement. On the other hand, the separated version really is quadratic... and yet it's faster (on this test case)! That's the conundrum. Please don't let my nit-picking slow you down or discourage you. It's just that I think you've hit something quite interesting, and, well, I'm interested. :) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:40 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): OK. I looked at `pushCoTyArg` and friends, and I have a very simple solution: just move the `isReflexiveCo` case in `addCoerce` (a local function within `Simplify.simplCast`) to the top. That should do it. Then `pushCoTyArg` is never called with a reflexive coercion, and so the `piResultTy` case won't happen. Now, `pushCoArgs` might still call `pushCoTyArg` with a reflexive coercion, but it can be taught not to as well: Have `pushCoArgs` return a `Maybe ([CoreArg], Maybe Coercion)` and `pushCoArg` return a `Maybe (CoreArg, Maybe Coercion)`. If the second return values are `Nothing`, that means that there is no cast (i.e., that the cast would have been reflexive). The only client of `pushCoArg(s)` is `exprIsConApp_maybe`, which simply omits a cast if `pushCoArgs` returns `Nothing`. Then, we never have to bother creating the reflexive coercions. This should be an easy win all around. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:41 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): I've taken the liberty of pushing directly to the `wip/tdammers/T11735` branch. I've done a bit of testing, and things seem OK, but a full testsuite run is surely warranted. Also, I've done no performance study of any kind, so that will surely need to be done, too. I'm curious to see what you find! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:42 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): OK. I've fixed a few testsuite failures. I think this is probably correct now. (CircleCI reports a bunch of errors in `profiling`, but I can't repro locally.) What happens when you benchmark? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:43 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:43 goldfire]:
OK. I've fixed a few testsuite failures. I think this is probably correct now. (CircleCI reports a bunch of errors in `profiling`, but I can't repro locally.)
What happens when you benchmark?
Curiously, the latest version clocks in at about 25 seconds for Grammar.hs, so almost as fast, but not quite, as the fastest we've seen so far (~21 seconds): {{{ Mon Jan 29 09:43 2018 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -RTS -B/home/tobias/well- typed/devel/ghc/inplace/lib Grammar.hs -ddump-stg -ddump-simpl -ddump-to- file -fforce-recomp total time = 25.95 secs (25950 ticks @ 1000 us, 1 processor) total alloc = 33,815,717,032 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc simplCast Simplify compiler/simplCore/Simplify.hs:871:62-87 26.0 21.8 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 19.9 24.7 Stg2Stg HscMain compiler/main/HscMain.hs:1489:12-44 17.6 20.9 addCoerce-pushCoTyArg Simplify compiler/simplCore/Simplify.hs:(1251,12)-(1254,36) 10.5 9.0 tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7) 7.3 6.0 subst_ty TyCoRep compiler/types/TyCoRep.hs:2240:28-32 3.5 4.4 coercionKind Coercion compiler/types/Coercion.hs:1711:3-7 2.1 4.6 zonkTopDecls TcRnDriver compiler/typecheck/TcRnDriver.hs:(445,16)-(446,43) 1.1 0.9 }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:44 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Sorry, still confused. :(
How are `coercionKind` and `coercionRole` ''mutually'' recursive? I see
But you're right that I'm trying to understand better why there's a
#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:40 goldfire]: that `coercionRole` calls `coercionKind` but not the other way around. Sorry for the confusion, I shouldn't be trying to explain things I only half understand myself when tired. Looking at HEAD on `master`, we have the `coercionKindRole` function that is simply recursive for NthCo, and has the `nth` call embedded. So for nested NthCo's, its behavior should be quadratic. The un-refactored version has `coercionKind`, which is simply recursive (O(n)), and we have `coercionRole`, which recurses via `coercionKindRole`. The latter is really just a matter of calling `coercionKind` and `coercionRole` individually though; the `coercionRole` call makes `coercionRole` simply recursive (O(n)), but the `coercionKind` call introduces another O(n), making the entire thing also quadratic. Calling this "mutually recursive" is of course a brainfart on my side, since `coercionKind` never calls back into `coercionRole`. So in terms of big-O, HEAD and "un-refactored" should be the same. Why one perfoms better than the other is somewhat unclear to me though. performance improvement in this patch (even before any caching). In the nested `NthCo` case, I'm pretty sure your refactor would be worse. But in the test case at hand (which I assume doesn't have nested `NthCo`s -- haven't looked), your change is clearly an improvement.
However, I don't think your analysis above is really the problem. I
would expect that the running time of `coercionKind` or `coercionRole` on nested `NthCo`s to be linear in the sum of the `d`s -- that is, we'll have to add together all the indices. You've shown above that the old recursion pattern (from `coercionKindRole`) traverses down the linked list twice (once in `getNth` and once in `nthRole`), but this shouldn't change asymptotic complexity. And, usually, `d` is quite small, and so I wouldn't expect this to show up at all, really. I still don't think we've quite gotten to the bottom of why separating out `coercionKind` and `coercionRole` should effect a performance improvement. Indeed, it's not the problem, or rather, the un-refactoring alone doesn't fix anything - if it makes a difference at all, then it's most likely just a constant-factor improvement, nothing fundamental. But that's not why we did it anyway; the reason we did it was so that we could more easily implement the caching (storing precalculated roles in the NthCo itself), which breaks down one of the linear terms to constant.
On the other hand, the separated version really is quadratic... and yet it's faster (on this test case)! That's the conundrum.
Yes, but we don't know if it's actually big-O-faster, or just happens have a more favorable constant factor.
Please don't let my nit-picking slow you down or discourage you. It's just that I think you've hit something quite interesting, and, well, I'm interested. :)
Absolutely not, your input has been super helpful, and I much prefer hard criticism over empty praise. Bring it on :) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:45 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): I ran a selection of commits against two test inputs: Grammar.hs, as above, and the `test_rules2.hs` file from #5631. Here's a list of execution times for both: {{{ --- ./cases/Grammar.hs --- 452dee3ff4: 245.51 4572849929: 19.62 4eb140f564: 12.25 8a6aa5030d: 12.26 d74b37d565: 16.14 --- ./cases/test_rules2.hs --- 452dee3ff4: 1.20 4572849929: 1.19 4eb140f564: 1.19 8a6aa5030d: 1.23 d74b37d565: 1.18 }}} The commits are: {{{ 452dee3ff4: GHC `master` before branching off on this issue (should be the same as GHC HEAD) 4572849929: Simon's patch from [comment:5] applied. 4eb140f564: After "un-refactoring" 8a6aa5030d: With coercion roles cached, and mkNthCo taking an extra Role argument d74b37d565: current HEAD of wip/tdammers/T11735 }}} (Note that the execution times are a bit faster overall in this run because I didn't pass any `-ddump` flags, so GHC spends no time pretty- printing the dumps). So, conclusions: - The [comment:5] patch makes a huge difference for `Grammar.hs` (down to less than 10% in execution time) - "un-refactoring", strangely enough, improves performance on `Grammar.hs` by roughly another 25% or so - caching coercion rules in NthCo doesn't seem to make a difference at all, and even makes things slightly worse (this one is truly baffling IMO) - improvements from [comment:41] make the `Grammar.hs` test case worse (this, too, is unexpected) - none of the changes here seem to affect performance for the `test_rules2.hs` case much, if at all -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:46 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Currently doing a full test run, and I don't have the full result yet, but it seems that many tests are failing. Not sure if this is due to these particular code changes though, I will look into it once the run finishes. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:47 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): There are a number of different things going on. 1. comment:5 gets rid of a non-linearity in `coercionKind` for the `ForallCo` case. This is a big win. We definitely want it. 2. The same non-linearity is present in `coercionKindRole` so I suggested getting rid of the duplication between the two. That seems possible by caching the role in `NthCo`. Doing so doesn't solve any known performance problems, but seems to be nicer code. 3. We have a clear problem in `simplCast`, which comes from transforming {{{ (f |> co) @t1 @t2 ... @tn ===> ((f @t1') |> co1) @t2 ... @tn ===> (etc) (f @t1' @t2' ... @tn') |> con }}} Doing these steps one at a time gives clearly non-linear behaviour. Hence comment:10. I think comment:41 is in this territory. I think it'd be helpful to separate these issues, possibly into separate tickets. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:48 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Looking at HEAD on `master`, we have the `coercionKindRole` function
#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): Replying to [comment:45 tdammers]: that is simply recursive for NthCo, and has the `nth` call embedded. So for nested NthCo's, its behavior should be quadratic. I disagree here. I think the old `coercionKindRole` was not quadratic in this way (ignoring the `ForAllCo` change). Every `coercionKindRole` recurrence calls `getNth` (twice), but that doesn't lead to quadratic behavior in the depth of `NthCo` nesting, which is what we're worried about. It does mean that running time will be proportional to the sum of the indices in the `NthCo`s, but that's to be expected. Let's put this another way: pretend all the indices in the `NthCo`s are 1, a constant. (This is fairly close to reality, anyway.) Then, we're linear, not quadratic.
The un-refactored version has `coercionKind`, which is simply recursive
(O(n)), and we have `coercionRole`, which recurses via `coercionKindRole`. The latter is really just a matter of calling `coercionKind` and `coercionRole` individually though; the `coercionRole` call makes `coercionRole` simply recursive (O(n)), but the `coercionKind` call introduces another O(n), making the entire thing also quadratic. Agreed here. But note that this would remain quadratic even if all the indices in the `NthCo`s were 1. So, if you call the original `coercionKindRole` quadratic, then this would be ''cubic'' (but I don't think that's a fair characterization -- there's nothing raised to the third power here).
So in terms of big-O, HEAD and "un-refactored" should be the same. Why
one perfoms better than the other is somewhat unclear to me though. Disagree here, as explained above. The old `coercionKindRole` was asymptotically better. But now that we cache roles, it should be good again.
On the other hand, the separated version really is quadratic... and yet it's faster (on this test case)! That's the conundrum.
Yes, but we don't know if it's actually big-O-faster, or just happens have a more favorable constant factor.
Fair enough. I'm stymied as to why my patches make things worse. Maybe moving the `isReflexiveCo` check in `addCoerce` to the top was a bad idea? Try moving that back to where it was and then try again. The `simplCast` stuff should be vastly better than it was! Thanks! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:49 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Let's get (1) and (2) nailed before starting in on (3) and (4). Maybe move (3) and (4) to fresh tickets. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:50 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:49 goldfire]:
It does mean that running time will be proportional to the sum of the indices in the `NthCo`s, but that's to be expected.
Ah, you are right, this is where I went wrong - the `nth` calls follow the indices, not the nested NthCos themselves. Also explains why I'm not seeing the kind of differences in the profiles that I expected.
The un-refactored version has `coercionKind`, which is simply
recursive (O(n)), and we have `coercionRole`, which recurses via `coercionKindRole`. The latter is really just a matter of calling `coercionKind` and `coercionRole` individually though; the `coercionRole` call makes `coercionRole` simply recursive (O(n)), but the `coercionKind` call introduces another O(n), making the entire thing also quadratic.
Agreed here. But note that this would remain quadratic even if all the
indices in the `NthCo`s were 1. So, if you call the original `coercionKindRole` quadratic, then this would be ''cubic'' (but I don't think that's a fair characterization -- there's nothing raised to the third power here).
So in terms of big-O, HEAD and "un-refactored" should be the same. Why
one perfoms better than the other is somewhat unclear to me though.
Disagree here, as explained above. The old `coercionKindRole` was
asymptotically better. But now that we cache roles, it should be good again. Profiling results don't agree, but other than that it seems plausible. Maybe we're hitting some sort of edge case here?
On the other hand, the separated version really is quadratic... and
Yes, but we don't know if it's actually big-O-faster, or just happens
have a more favorable constant factor.
Fair enough.
I'm stymied as to why my patches make things worse. Maybe moving the `isReflexiveCo` check in `addCoerce` to the top was a bad idea? Try moving
yet it's faster (on this test case)! That's the conundrum. that back to where it was and then try again. The `simplCast` stuff should be vastly better than it was! This one has me baffled as well. There's a slight possibility that the profiling runs I did were contaminated with other activity on the same machine, so I might do another run on a separate machine with nothing else going on. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:51 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Split off the `simplCast` part into #14737, suggest we continue the discussion there. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:52 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): `CoreTidy` performance (part 4) moved to #14738. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:53 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Re-running my profiling job on a pristine machine without significant background noise gives me this: {{{ --- ./cases/Grammar.hs --- d74b37d565: 16.21 8ac966971e: 12.87 4eb140f564: 12.38 73a99750e1: 20.00 4572849929: 19.67 452dee3ff4: 256.07 --- ./cases/test_rules2.hs --- d74b37d565: 1.12 8ac966971e: 1.10 4eb140f564: 1.10 73a99750e1: 1.10 4572849929: 1.07 452dee3ff4: 1.03 }}} Again, with meanings: {{{ d74b37d565: simplCore improvements 8ac966971e: With NthCo Role caching 4eb140f564: Un-refactored 73a99750e1: Added some SCC's 4572849929: Simon's patch applied 452dee3ff4: baseline, GHC `master` before branching off }}} Which mostly confirms the earlier conclusions. The `test_rules2.hs` example seems to get worse as we proceed through the changes though. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:54 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Do you have allocation numbers for these runs? The mysterious thing is that `NthCo` caching makes things slightly worse. Of course the `NthCo` constructors have an extra field, but I'm still quite surprised that's a visible worsening. So I think it might be worth a little more digging into the worsening in `8ac966971e`, really just to discover if we have accidentally left money on the table. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:55 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): NB: the final one, `d74b37d565: simplCore improvements` is the domain of #14737 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:56 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:55 simonpj]:
Do you have allocation numbers for these runs?
The mysterious thing is that `NthCo` caching makes things slightly worse. Of course the `NthCo` constructors have an extra field, but I'm still quite surprised that's a visible worsening. So I think it might be worth a little more digging into the worsening in `8ac966971e`, really just to discover if we have accidentally left money on the table.
Yes, of course. This is `8ac966971e`: {{{ Tue Jan 30 11:25 2018 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -RTS -B/home/tobias/well- typed/devel/ghc/inplace/lib ./cases/Grammar.hs -o ./a -fforce-recomp total time = 12.87 secs (12865 ticks @ 1000 us, 1 processor) total alloc = 14,385,409,080 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc simplCast Simplify compiler/simplCore/Simplify.hs:871:62-87 33.7 32.4 addCoerce-pushCoTyArg Simplify compiler/simplCore/Simplify.hs:(1236,12)-(1237,72) 13.1 13.1 tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7) 12.3 14.0 subst_ty TyCoRep compiler/types/TyCoRep.hs:2237:28-32 6.7 10.4 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 4.4 3.9 coercionKind Coercion compiler/types/Coercion.hs:1725:3-7 2.6 6.0 zonkTopDecls TcRnDriver compiler/typecheck/TcRnDriver.hs:(445,16)-(446,43) 2.0 2.2 load'.checkHowMuch GhcMake compiler/main/GhcMake.hs:(270,9)-(272,27) 2.0 0.0 simplCast-simplCoercion Simplify compiler/simplCore/Simplify.hs:1224:57-77 1.6 1.5 simplExprF1-Lam Simplify compiler/simplCore/Simplify.hs:896:5-39 1.6 2.2 deSugar HscMain compiler/main/HscMain.hs:511:7-44 1.5 1.3 simplCast-addCoerce Simplify compiler/simplCore/Simplify.hs:1225:53-71 1.4 1.3 tcRnImports TcRnDriver compiler/typecheck/TcRnDriver.hs:240:20-50 1.0 0.1 Parser HscMain compiler/main/HscMain.hs:(316,5)-(384,20) 0.9 1.6 }}} Which isn't actually significantly worse than `4eb140f564`: {{{ Tue Jan 30 11:45 2018 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -RTS -B/home/tobias/well- typed/devel/ghc/inplace/lib ./cases/Grammar.hs -o ./a -fforce-recomp total time = 12.38 secs (12380 ticks @ 1000 us, 1 processor) total alloc = 14,385,403,880 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc simplCast Simplify compiler/simplCore/Simplify.hs:871:62-87 34.3 32.4 addCoerce-pushCoTyArg Simplify compiler/simplCore/Simplify.hs:(1236,12)-(1237,72) 13.3 13.1 tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7) 12.7 14.0 subst_ty TyCoRep compiler/types/TyCoRep.hs:2225:28-32 7.0 10.4 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 4.5 3.9 coercionKind Coercion compiler/types/Coercion.hs:1707:3-7 2.9 6.0 zonkTopDecls TcRnDriver compiler/typecheck/TcRnDriver.hs:(445,16)-(446,43) 2.0 2.2 simplExprF1-Lam Simplify compiler/simplCore/Simplify.hs:896:5-39 1.8 2.2 simplCast-simplCoercion Simplify compiler/simplCore/Simplify.hs:1224:57-77 1.7 1.5 deSugar HscMain compiler/main/HscMain.hs:511:7-44 1.6 1.3 simplCast-addCoerce Simplify compiler/simplCore/Simplify.hs:1225:53-71 1.4 1.3 simplIdF Simplify compiler/simplCore/Simplify.hs:868:61-79 1.0 0.5 Parser HscMain compiler/main/HscMain.hs:(316,5)-(384,20) 1.0 1.6 }}} This is only about 4% more execution time, and 0.000035% more allocations. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:57 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Split off tasks 1 (the original patch from [comment:5]) and 2 (further refactoring of coercionKind/Role) into separate branches, cherry-picking commits onto current GHC HEAD. Branches are `wip/tdammers/T11735-1` and `wip/tdammers/T11735-2` (the latter also contains commits from the former, because I figured that would make more sense). I also pushed the patch for task 1 to phabricator [https://phabricator.haskell.org/D4355 D4355]. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:58 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): So the allocations are really close. Where does the extra time go? Can you show mutator and GC times? Maybe residency, and hence GC time, is higher? Perhaps just an accident of the moments at which major GC happens. What happens if you use use just one generation? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:59 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): I think I see what's going on here. The "cached role" results are from commit 8ac966971ec6c30cc3681a913fb9fb1c2342f6cc. This version eagerly computes the role for an `NthCo` in `mkNthCo`. This means that every (`seq`ed) `NthCo` requires a role calculation. Perhaps sometimes, we create an `NthCo` and never ever check its role. If that's the case, then 8ac96 will be worse. On the other hand, what if you try my commit 8a6aa5030d34592200fbe799bf38abf3701544db? (Do not be thrown off by the same first two characters of the hash! I was.) That commit supplies the role to `mkNthCo`, as it can often be deduced by the caller of `mkNthCo` without too much trouble. No redundant role computation. This one should really be a clear win. By the way, I changed slightly the way I cached the role: I allowed a more permissive role in the `NthCo`. For example, suppose you have `g :: [a] ~N [b]` and you want `h :: a ~R b`. Before these patches, you would use `h = SubCo (NthCo 0 g)`. With the patch, you can now say `h = NthCo Representational 0 g`, where Lint checks to make sure that the role is appropriate. Because of this new degree of freedom, it became necessary to serialize the role in iface files. (The serializer could instead be clever and convert an `NthCo` into a `SubCo` and an `NthCo`, but that seemed not to be worth it.) I forget if this realization about iface files made it into the original patch -- that change might be in a later commit. But the commit I referenced above should work fine outside of iface files. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:60 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): It's nice that `OptCoercion` for `NthCo` gets simpler when we have the role cached.
I allowed a more permissive role in the NthCo
OK, but * I didn't see a change to `mkSubCo` that would take advantage of this (by changing the role rather than wrapping in `SubCo`) * The comment on the constructor `NthCo` should state the invariant. Given `(NthCo r n co)` we require that `lteRole r (nthRole (coercionRole co) n))` or something like that. * Why does't the same flexibility apply for the other coercions that cache roles: `FunCo`, `Refl`, `TyConAppCo`, `UnivCo`? We should state their invariants too... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:61 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): All good points. Reconsidering this, I think it's probably best for me to undo this choice -- it adds a subtle complication to the theory for a very tiny upside. I probably won't get to do this until Friday, but I'll make this tweak. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:62 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): Note that this last point should have almost no effect at all on performance, and so any data collected now would still be valid. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:63 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Fine. In undoing, could you also add the invariants for all Role annotations to the data type itself? * The comment on the constructor NthCo should state the invariant. Given (NthCo r n co) we require that lteRole r (nthRole (coercionRole co) n)) or something like that. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:64 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind
-------------------------------------+-------------------------------------
Reporter: goldfire | Owner: (none)
Type: task | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.10.3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:60 goldfire]:
I think I see what's going on here. The "cached role" results are from commit 8ac966971ec6c30cc3681a913fb9fb1c2342f6cc. This version eagerly computes the role for an `NthCo` in `mkNthCo`. This means that every (`seq`ed) `NthCo` requires a role calculation. Perhaps sometimes, we create an `NthCo` and never ever check its role. If that's the case, then 8ac96 will be worse.
On the other hand, what if you try my commit 8a6aa5030d34592200fbe799bf38abf3701544db? (Do not be thrown off by the same first two characters of the hash! I was.) That commit supplies the role to `mkNthCo`, as it can often be deduced by the caller of `mkNthCo` without too much trouble. No redundant role computation. This one should really be a clear win.
I've done another run using just these three commits, and it shows that the last one is slightly better than the original situation for the `test_rules2` case, and about the same for the `Grammar.hs` case: {{{ --- ./cases/Grammar.hs --- 4eb140f564: 12.31 8ac966971e: 12.70 8a6aa5030d: 12.31 --- ./cases/test_rules2.hs --- 4eb140f564: 1.11 8ac966971e: 1.14 8a6aa5030d: 1.09 }}} So the `8ac...` commit makes things slightly worse, but at `8a6...` we're back at equal-or-better. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:66 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind
-------------------------------------+-------------------------------------
Reporter: goldfire | Owner: (none)
Type: task | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.10.3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Simon Peyton Jones

#11735: Optimize coercionKind
-------------------------------------+-------------------------------------
Reporter: goldfire | Owner: (none)
Type: task | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.10.3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by tdammers):
Replying to [comment:67 Simon Peyton Jones
In [changeset:"db5a4b83b14bf4f1adb417b7664347fdaf637fd6/ghc" db5a4b83/ghc]: {{{ #!CommitTicketReference repository="ghc" revision="db5a4b83b14bf4f1adb417b7664347fdaf637fd6" Re-center improved perf for T3064
There's a 6% reduction in allocation on T3064. I think it's due to
commit e4ab65bd57b2c39f4af52879654514bb6d5b42a0 Author: Tobias Dammers
Date: Wed Jan 31 21:39:45 2018 -0500 Optimize coercionKind (Trac #11735)
I'm not certain -- but, hey, it's good news }}}
We already knew that this change gives us insane improvements in at least one case, so I guess the fact that we see this reflected somewhere in the test suite is good, I presume. Wonder what T3064 does and whether it's related. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:68 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:59 simonpj]:
So the allocations are really close. Where does the extra time go?
Can you show mutator and GC times? Maybe residency, and hence GC time, is higher? Perhaps just an accident of the moments at which major GC happens. What happens if you use use just one generation?
I've done a run with `-s`, which gives us: {{{ --- ./cases/Grammar.hs --- 4eb140f564: 12.31 sec, max res. 257,842,544 8ac966971e: 12.70 sec, max res. 425,496,520 8a6aa5030d: 12.31 sec, max res. 254,680,256 --- ./cases/test_rules2.hs --- 4eb140f564: 1.11 sec, max res. 84,030,624 8ac966971e: 1.14 sec, max res. 81,325,072 8a6aa5030d: 1.09 sec, max res. 56,322,200 }}} So it looks like your hunch was about right: in the Grammar.hs case, max residency almost doubles in `8ac...`, but `8a6` then gets us back to normal; for the test_rules2.hs case, we see only improvement. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:69 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Status: - Thing 1: Getting rid of non-linearity in coercionKind for the ForallCo case. Done. - Thing 2: Attacking non-linearity in coercionKindRole on NthCo's. I think this part is also more or less done, but I haven't seen anything on phab yet. - Thing 3: Fixing performance issues with `simplCast` / `SimplCore`. Moved to #14737, I believe @goldfire is still working on this. - Thing 4: Investigating `CoreTidy`. Moved to #14738, however it seems that the problem no longer manifests on GHC HEAD. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:70 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): I'm not still working on this. I pushed the changes I had in mind to the tdammers branch. If the performance is OK there, could you please Phab and then commit? As for `simplCast`, the changes I had in mind are all on your branch, including removing the inefficient one-variable substitution. If my changes aren't an improvement, try undoing the change I made to the ordering of cases in `simplCast`: I moved the `isReflexiveCo` case to the top, but perhaps that check is too expensive and is better at the bottom. Regardless, the changes I made in CoreOpt should be enough to improve performance. Thanks! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:71 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): https://phabricator.haskell.org/D4385 should have all the relevant changes for subtasks 2 and 3, and the SCC's I added to (unsuccessfully) hunt down subtask 4. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:72 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Good progress * Let's split the commits for Thing 2 and Thing 3; they are quite different. * We agreed in comment:62 not to do the subtle `SubCo` stuff for `NthCo`; that means some changes (simplifications). * I have not properly reviewed the `simplCast` changes, but they don't appear to follow the plan I suggest in comment:10. Richard can you comment on that? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:73 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): Yes, I know changing `NthCo` caching is outstanding. For the `simplCast` changes: you're right -- I ignored your suggestions opting for something simpler. In HEAD, `pushTyCoArg` (along with friends) goes to some trouble to create a type-correct `Refl` coercion. But this coercion soon ends up being used in a cast, where `Refl`s are simply discarded. So all that work in building the coercion is wasted. My change is to return `Maybe Coercion`, preferring `Nothing` over `Just (Refl ...)`. This avoids the inefficient one-variable substitution that we were worrying about. It should be a clear win; we're doing strictly less work. The small complication is that there are these `Maybe`s floating about, but I think it's worth it. I'm hoping the comments on the `push` functions are clear enough, but perhaps I didn't include any information on ''why'' `Maybe Coercion` is better than `Coercion` here. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:74 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): OK: once we have a separate Phab for the `simplCast` change, I'll review it. Ping me, Tobias! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:75 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): OK. I've pushed my refactoring of `NthCo` to the branch we've been working on. But I'm unsure the current state of affairs for the Phab stuff, so I didn't upload there. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:76 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:75 simonpj]:
OK: once we have a separate Phab for the `simplCast` change, I'll review it. Ping me, Tobias!
I split up D4385 into https://phabricator.haskell.org/D4394 (the NthCo stuff) and https://phabricator.haskell.org/D4395 (simplCast / Simplify). May I assume that the refactoring mentioned in comment:76 should go in D4394? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:77 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by goldfire): Yes, please. Many thanks! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:78 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Alright, all the changes from the wip/11735 branch should be on either D4394 or D4395 now. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:79 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Effect of D4394 on compilation of `Grammar.hs`: Compiling with GHC HEAD (`02b3dad195`), built with the `quick` profile: {{{ tobias@zoidberg:~/well-typed/devel/ghc-T14683/ > time ../ghc/HEAD/inplace/bin/ghc-stage2 grammar-hs/Grammar.hs -fforce-recomp -O2 [1 of 1] Compiling Grammar ( grammar-hs/Grammar.hs, grammar- hs/Grammar.o ) ../ghc/HEAD/inplace/bin/ghc-stage2 grammar-hs/Grammar.hs -fforce-recomp -O2 465.26s user 0.52s system 100% cpu 7:45.42 total }}} Compiling with D4394 applied (`6493976fdb`), built with the `quick` profile: {{{ tobias@zoidberg:~/well-typed/devel/ghc-T14683/ > time ../ghc/D4394/inplace/bin/ghc-stage2 grammar-hs/Grammar.hs -fforce-recomp -O2 [1 of 1] Compiling Grammar ( grammar-hs/Grammar.hs, grammar- hs/Grammar.o ) ../ghc/D4394/inplace/bin/ghc-stage2 grammar-hs/Grammar.hs -fforce-recomp -O2 25.95s user 0.36s system 94% cpu 27.977 total }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:80 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #14683 #14975 | Differential Rev(s): D4394 D4395 Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * differential: => D4394 D4395 * related: => #14683 #14975 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:81 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #14683 #14975 | Differential Rev(s): D4394 D4395 Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Profile after applying D4395 (coercionKind / Role refactoring): {{{ Mon Mar 26 15:24 2018 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -RTS -B/home/tobias/well- typed/devel/ghc/D4395-modified/inplace/lib ./cases/Grammar.hs -o ./a -fforce-recomp total time = 19.66 secs (19655 ticks @ 1000 us, 1 processor) total alloc = 24,638,084,488 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc simplCast Simplify compiler/simplCore/Simplify.hs:(1213,5)-(1215,37) 73.8 76.0 tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7) 8.3 8.4 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 2.8 2.2 SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:770:39-74 2.7 1.9 coercionKind Coercion compiler/types/Coercion.hs:1698:3-7 1.8 3.4 zonkTopDecls TcRnDriver compiler/typecheck/TcRnDriver.hs:(445,16)-(446,43) 1.3 1.3 deSugar HscMain compiler/main/HscMain.hs:511:7-44 1.1 0.8 }}} This means that performance is on par, but `simplCast` still scores very high, so we should do more digging there. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:82 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #14683 #14975 | Differential Rev(s): D4394 D4395 Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Further digging into `simplCast` performance to be handled in #14737. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:83 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #14683 #14975 | Differential Rev(s): D4394 D4395 Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Yes!! `simplCast` should not take 73% of compiler time!! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:84 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind -------------------------------------+------------------------------------- Reporter: goldfire | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #14683 #14975 | Differential Rev(s): D4394 D4395 #14737 | Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * related: #14683 #14975 => #14683 #14975 #14737 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11735#comment:85 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11735: Optimize coercionKind
-------------------------------------+-------------------------------------
Reporter: goldfire | Owner: (none)
Type: task | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.10.3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #14683 #14975 | Differential Rev(s): D4394 D4395
#14737 |
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#11735: Optimize coercionKind
-------------------------------------+-------------------------------------
Reporter: goldfire | Owner: (none)
Type: task | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.10.3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #14683 #14975 | Differential Rev(s): D4394 D4395
#14737 |
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari
participants (1)
-
GHC