Re: [GHC] #13851: Change in specialisation(?) behaviour since 8.0.2 causes 6x slowdown

22 Jun 2017

      #13851: Change in specialisation(?) behaviour since 8.0.2 causes 6x slowdown
-------------------------------------+-------------------------------------
        Reporter:  mpickering        |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  high              |            Milestone:  8.2.1
       Component:  Compiler          |              Version:  8.2.1-rc2
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by simonpj):

 Here is what is happening

 * Before float-out we have
 {{{
   $stest1mtl = \eta. ...foldr (\x k z. blah) z e...
 }}}
   Since the first arg of the foldr has no free vars, we float it out to
 give
 {{{
   lvl = \x y z. blah
   $stest1mtl = \eta. ...foldr lvl z e...
 }}}

 * That makes `$stest1mtl` small, so it is inlined at its two call sites
 (the first two test case in `main`).

 * So now there are two calls to `lvl`, and it is quite big, so it doesn't
 get inlined.

 * But actually it is much better ''not'' to inline `$stest1mtl`, and
 instead (after the foldr/build stuff has happened) to inline `lvl` back
 into it.

 This kind of thing not new; I trip over it quite often.   Generally, given
 {{{
   f = e
   g = ...f..
   h = ...g...g..f...
 }}}
 should we inline `f` into `g`, thereby making `g` big, so it doesn't
 inline into `h`? Or should we instead inline `g` into `h`?  Sometimes one
 is better, sometimes the other; I don't know any systematic way of doing
 The Right Thing all the time.  It turned out that the early-inline patch
 changed the choice, which resulted in the changed performance.

 However I did spot several things worth trying out

 * In `CoreArity.rhsEtaExpandArity` we carefully do not eta-expand thunks.
 But I saw some thunks like
 {{{
         lvl_s621
           = case z_a4NJ of wild_a4OF { GHC.Types.I# x1_a4OH ->
             case x_a4NH of wild1_a4OJ { GHC.Types.I# y1_a4OL ->
             case GHC.Prim.<=# x1_a4OH y1_a4OL of {
               __DEFAULT -> (\ _ (eta_B1 :: Int) -> (wild_a4OF, eta_B1))
               1# ->        (\ _ (eta_B1 :: Int) -> (wild1_a4OJ, eta_B1))
 }}}
   Here it really would be good to eta-expand; then that particular `lvl`
 could be inlined at its call sites.  Here's a change to
 `CoreArity.rhsEtaExpandArity` that did the job:
 {{{
 -        | isOneShotInfo os || has_lam e -> 1 + length oss
 +        | isOneShotInfo os || not (is_app e) -> 1 + length oss

 -    has_lam (Tick _ e) = has_lam e
 -    has_lam (Lam b e)  = isId b || has_lam e
 -    has_lam _          = False
 +    is_app (Tick _ e) = is_app e
 +    is_app (App f _)  = is_app f
 +    is_app (Var _)    = True
 +    is_app _          = False
 }}}
   Worth trying.

 * Now the offending top-level `lvl` function is still not inlined; but it
 has a function argument that is applied, so teh call sites look like
 {{{
       lvl ... (\ab. blah) ...
 }}}
   When considering inining we do get a discount for the application of the
 argument inside `lvl`'s rhs, but it was only a discout of 60, which seems
 small considering how great it is to inline a function.  Boosting it to
 150 with `-funfolding-fun-discount=150` make the function inline, and we
 get good code all round.  Maybe we should just up the default.

 * All the trouble is caused by the early float-out.  I think we could try
 just elminating it.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13851#comment:5
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler