Re: [GHC] #8763: forM_ [1..N] does not get fused (10 times slower than go function)

29 Mar 2018

      #8763: forM_ [1..N] does not get fused (10 times slower than go function)
-------------------------------------+-------------------------------------
        Reporter:  nh2               |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:  8.6.1
       Component:  Compiler          |              Version:  7.6.3
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #7206             |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by sgraf):

 Here is a smaller example that highlights the problem without vectors. The
 only difference between the two functions is the use of `[2,3..n]` instead
 of `[2..n]`, which desugar to different functions. This results in a
 difference in a huge difference in allocation as well as runtime:

 {{{
 $ ./repro 2 +RTS -s # [2,3..n]
 ()
      960,056,856 bytes allocated in the heap
           21,536 bytes copied during GC
           44,576 bytes maximum residency (2 sample(s))
           29,152 bytes maximum slop
                2 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0       918 colls,     0 par    0.005s   0.003s     0.0000s
 0.0000s
   Gen  1         2 colls,     0 par    0.000s   0.000s     0.0001s
 0.0002s

   INIT    time    0.000s  (  0.000s elapsed)
   MUT     time    0.123s  (  0.125s elapsed)
   GC      time    0.005s  (  0.003s elapsed)
   EXIT    time    0.000s  (  0.000s elapsed)
   Total   time    0.129s  (  0.129s elapsed)

   %GC     time       4.1%  (2.5% elapsed)

   Alloc rate    7,778,808,106 bytes per MUT second

   Productivity  95.8% of total user, 97.4% of total elapsed
 }}}

 {{{
 $ ./repro 1 +RTS -s # [2..n]
 ()
           56,872 bytes allocated in the heap
            3,480 bytes copied during GC
           44,576 bytes maximum residency (1 sample(s))
           25,056 bytes maximum slop
                2 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s
 0.0000s
   Gen  1         1 colls,     0 par    0.000s   0.000s     0.0001s
 0.0001s

   INIT    time    0.000s  (  0.000s elapsed)
   MUT     time    0.048s  (  0.048s elapsed)
   GC      time    0.000s  (  0.000s elapsed)
   EXIT    time    0.000s  (  0.000s elapsed)
   Total   time    0.048s  (  0.048s elapsed)

   %GC     time       0.2%  (0.2% elapsed)

   Alloc rate    1,188,432 bytes per MUT second

   Productivity  99.6% of total user, 99.6% of total elapsed
 }}}

 This happens in `ST`, but not in `IO`, so probably related to some hack.
 Also the difference vanishes when we allow the functions to inline.

 Here's some Core for `g` (the offending function):

 {{{
 -- RHS size: {terms: 235, types: 242, coercions: 61, joins: 4/13}
 $wg
 $wg
   = \ @ s ww w ->
       let { $wc = <huge> } in
       case <# ww 3# of {
         __DEFAULT ->
           let {
             y'
             y' = -# ww 1# } in
           letrec {
             go_up
             go_up
               = \ x eta ->
                   case ># x y' of {
                     __DEFAULT -> $wc x ((go_up (+# x 1#)) `cast` Co:4)
 eta;
                     1# -> $wc x (lvl `cast` Co:4) eta
                   }; } in
           $wc 2# ((go_up 3#) `cast` Co:4) w;
         1# ->
           case <# ww 2# of {
             __DEFAULT -> $wc 2# (lvl `cast` Co:4) w;
             1# -> (# w, () #)
           }
       }
 }}}

 From my understanding of join points, `$wc` is only nearly a join point,
 because `go_up` with its transitive tail call to `$wc` appears in argument
 position. It would be great if we could get rid of this! The `IO` variant
 (`g 40000000 >>= print`) doesn't have this weakness, it's all join points
 there. Hence my suspicion about some `IO` hack that let's GHC eta-expand
 stuff more aggresively, but I'm not sure how that's helping.

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:44
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler