Re: [GHC] #9476: Implement late lambda-lifting

30 Nov 2018

      #9476: Implement late lambda-lifting
-------------------------------------+-------------------------------------
        Reporter:  simonpj           |                Owner:  sgraf
            Type:  feature request   |               Status:  closed
        Priority:  normal            |            Milestone:  8.8.1
       Component:  Compiler          |              Version:  7.8.2
      Resolution:  fixed             |             Keywords:  LateLamLift
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #8763 #13286      |  Differential Rev(s):  Phab:D5224
       Wiki Page:  LateLamLift       |
-------------------------------------+-------------------------------------

Comment (by sgraf):

 Thanks for pointing me to `-G1`, very interesting! The difference in bytes
 copied and consequently runtime is even more grave:

 {{{
 $ ./default 19 +RTS -s -G1 > /dev/null
      359,455,256 bytes allocated in the heap
      334,966,000 bytes copied during GC
      188,250,032 bytes maximum residency (9 sample(s))
        2,125,824 bytes maximum slop
              179 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0         9 colls,     0 par    0.431s   0.431s     0.0478s
 0.2337s

   INIT    time    0.000s  (  0.000s elapsed)
   MUT     time    0.123s  (  0.123s elapsed)
   GC      time    0.431s  (  0.431s elapsed)
   EXIT    time    0.000s  (  0.000s elapsed)
   Total   time    0.554s  (  0.554s elapsed)

   %GC     time       0.0%  (0.0% elapsed)

   Alloc rate    2,928,555,183 bytes per MUT second

   Productivity  22.2% of total user, 22.2% of total elapsed

 $ ./allow-cg 19 +RTS -s -G1 > /dev/null
      401,712,312 bytes allocated in the heap
      185,583,392 bytes copied during GC
       97,712,192 bytes maximum residency (38 sample(s))
        1,275,840 bytes maximum slop
               93 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0        38 colls,     0 par    0.221s   0.221s     0.0058s
 0.1098s

   INIT    time    0.000s  (  0.000s elapsed)
   MUT     time    0.104s  (  0.104s elapsed)
   GC      time    0.221s  (  0.221s elapsed)
   EXIT    time    0.000s  (  0.000s elapsed)
   Total   time    0.325s  (  0.325s elapsed)

   %GC     time       0.0%  (0.0% elapsed)

   Alloc rate    3,878,228,290 bytes per MUT second

   Productivity  31.9% of total user, 31.9% of total elapsed

 }}}

 The residency was cut in half! Also note the difference in number of
 collections and that MUT is lower than the baseline (that doesn't lift the
 `go` function above). Probably a caching side-effect of the smaller
 residency, as the situation is still the same with `-A400M`, where the
 baseline is faster.

 I don't know how, but I suspect that lifting `go` causes the GC to be less
 conservative about liveness of some closure objects. If I had to guess,
 then something keeps the closure of `go` longer alive than the growing
 `sat2` thunk in `go`. I played around with heap/retainer profiling, but to
 no avail yet.

 Here is a gist with the `-S -G1` output:
 https://gist.github.com/sgraf812/5fbcf6b81fdd7c8af1a6060832bbfa11

 There are two interesting things to point out:

 1. The lifted version collects much more often, but only after completing
 the computing intensive work. Not sure why there are so many of them,
 seems redundant
 2. Compared to the baseline, the residency (and consequently the total
 heap size, it seems) grows slower, but the increase in total bytes
 allocated leads to an additional collection before everything drops to
 constant space.

 Not sure what to make of that data, but it doesn't contradict what I said
 about the closure of `go` being kept alive longer than `sat2`.

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:60
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler