Re: [GHC] #8763: forM_ [1..N] does not get fused (10 times slower than go function)

29 Mar 2018

      #8763: forM_ [1..N] does not get fused (10 times slower than go function)
-------------------------------------+-------------------------------------
        Reporter:  nh2               |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:  8.6.1
       Component:  Compiler          |              Version:  7.6.3
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #7206             |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by sgraf):

 It seems that for `IO`, GHC decides that it's OK to inline `c` from the
 [https://hackage.haskell.org/package/base-4.11.0.0/docs/src/GHC.Enum.html#efd...
 fusion helper of enumFromThenTo], but not so for `ST s`.

 For our case, `c` is the `<huge>` computation (see the worker `$wc` in
 comment:44) performed for each outer list element and would be duplicated
 by inlining: It's mentioned thrice in the definition of `efdtIntUpFB`.
 Consequently, `c` has almost always `Guidance=NEVER`, except in the `IO`
 case, where it miraculously gets `Guidance=IF_ARGS [20 420 0] 674 0` just
 when it is inlined. Not sure what this decision is based on.

 The inlining decision for `eftIntFB` is much easier: `c`
 [https://hackage.haskell.org/package/base-4.11.0.0/docs/src/GHC.Enum.html#eft...
 only happens once there].

 I'm not sure if `IO` gets special treatment by the inliner, but I see a
 few ways out:

 * Do the same hacks for `ST`, if there are any which apply (ugly)
 * Reduce the number of calls to `c` in the implementation of
 `efdtIntUpFB`, probably for worse branch prediction
 * Figure out why the floated out expression of `\x -> (nop x *>)` occuring
 in `forM_ nop = flip mapM_ nop = foldr ((>>) . nop) (return ())` doesn't
 get eta-expanded in the `ST` case, whereas the relevant `IO` code is. I
 hope that by fixing this, the `c` expression inlines again.

 Here's how it inlines for `IO`:

 {{{
   (>>) . nop
 = \x -> (nop x >>)
 = \x -> (nop x *>) -- notice how it's no different than ST up until here
 = \x -> (thenIO (nop x))
 }}}

 The inliner probably stops here, but because of eta-expansion modulo
 coercions to `\x k s -> thenIO (nop x) k s`, we can inline
 [https://hackage.haskell.org/package/base-4.11.0.0/docs/src/GHC.Base.html#the...
 thenIO]:

 {{{
   \x k s -> thenIO (nop x) y s
 = \x k s -> case nop x s of (# new_s, _ #) -> k new_s)
 }}}

 which is much better and probably more keenly inlined than `\x -> (nop x
 *>)` in the `ST` case. What makes GHC eta-expand one, but not the other?

 This is just a wild guess and the only real difference I could make out in
 diffs. Maybe someone with actual insights into the simplifier can comment
 on this claim (that the inliner gives up on `c` due to the missed eta-
 expansion and inlining)?

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:45
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler