Re: [GHC] #9476: Implement late lambda-lifting

27 Nov 2018

      #9476: Implement late lambda-lifting
-------------------------------------+-------------------------------------
        Reporter:  simonpj           |                Owner:  sgraf
            Type:  feature request   |               Status:  closed
        Priority:  normal            |            Milestone:  8.8.1
       Component:  Compiler          |              Version:  7.8.2
      Resolution:  fixed             |             Keywords:  LateLamLift
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #8763 #13286      |  Differential Rev(s):  Phab:D5224
       Wiki Page:  LateLamLift       |
-------------------------------------+-------------------------------------

Comment (by sgraf):

 I'm currently writing the evaluation chapter of the paper and hit quite a
 bummer.

 If I add an option to ''ignore'' closure growth completely, instructions
 improve by another 0.3% in the mean, with no regressions (!) wrt. the
 baseline where we try to avoid closure growth. Although total allocations
 possibly increase, executed instructions ''go down'' in some cases.

 The most prominent case is `paraffins`: +18.6% more allocations, but
 -11.7% less executed instructions! Example run:

 {{{
 $ ./default 19 +RTS -s > /dev/null
      359,457,968 bytes allocated in the heap
      536,016,504 bytes copied during GC
      163,983,272 bytes maximum residency (8 sample(s))
        1,699,968 bytes maximum slop
              156 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0       339 colls,     0 par    0.137s   0.137s     0.0004s
 0.0025s
   Gen  1         8 colls,     0 par    0.386s   0.386s     0.0482s
 0.1934s

   INIT    time    0.000s  (  0.000s elapsed)
   MUT     time    0.058s  (  0.058s elapsed)
   GC      time    0.523s  (  0.523s elapsed)
   EXIT    time    0.000s  (  0.000s elapsed)
   Total   time    0.581s  (  0.581s elapsed)

   %GC     time       0.0%  (0.0% elapsed)

   Alloc rate    6,243,240,498 bytes per MUT second

   Productivity   9.9% of total user, 9.9% of total elapsed

 $ ./allow-cg 19 +RTS -s > /dev/null
      426,433,296 bytes allocated in the heap
      488,364,024 bytes copied during GC
      139,063,776 bytes maximum residency (8 sample(s))
        2,223,648 bytes maximum slop
              132 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0       403 colls,     0 par    0.136s   0.136s     0.0003s
 0.0010s
   Gen  1         8 colls,     0 par    0.317s   0.317s     0.0397s
 0.1517s

   INIT    time    0.000s  (  0.000s elapsed)
   MUT     time    0.080s  (  0.080s elapsed)
   GC      time    0.453s  (  0.453s elapsed)
   EXIT    time    0.000s  (  0.000s elapsed)
   Total   time    0.533s  (  0.533s elapsed)

   %GC     time       0.0%  (0.0% elapsed)

   Alloc rate    5,359,023,067 bytes per MUT second

   Productivity  14.9% of total user, 14.9% of total elapsed

 }}}

 Note how allocations and number of collections (these are correlated) went
 up, but bytes copied and GC time (also correlated) went down. The only
 possible conclusion here is that viewing bytes allocated as a metric for
 predicting runtime is flawed: What matters most for runtime is bytes
 copied during GC. Of course there's an overhead for heap checks etc., but
 it seems that GC pressure far outweighs that.

 Interestingly, if I 'disable' the GC by allocating 600MB of nursery, the
 inverse effect manifests:

 {{{
 $ ./default 19 +RTS -s -A600M > /dev/null
      359,104,696 bytes allocated in the heap
            3,384 bytes copied during GC
           44,480 bytes maximum residency (1 sample(s))
           25,152 bytes maximum slop
                0 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s
 0.0000s
   Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s
 0.0002s

   INIT    time    0.009s  (  0.009s elapsed)
   MUT     time    0.127s  (  0.127s elapsed)
   GC      time    0.000s  (  0.000s elapsed)
   EXIT    time    0.000s  (  0.000s elapsed)
   Total   time    0.136s  (  0.136s elapsed)

   %GC     time       0.0%  (0.0% elapsed)

   Alloc rate    2,821,937,180 bytes per MUT second

   Productivity  93.4% of total user, 93.5% of total elapsed

 $ ./allow-cg 19 +RTS -s -A600M > /dev/null
      426,014,360 bytes allocated in the heap
            3,384 bytes copied during GC
           44,480 bytes maximum residency (1 sample(s))
           25,152 bytes maximum slop
                0 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s
 0.0000s
   Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s
 0.0002s

   INIT    time    0.011s  (  0.011s elapsed)
   MUT     time    0.142s  (  0.142s elapsed)
   GC      time    0.000s  (  0.000s elapsed)
   EXIT    time    0.000s  (  0.000s elapsed)
   Total   time    0.153s  (  0.153s elapsed)

   %GC     time       0.0%  (0.0% elapsed)

   Alloc rate    2,994,250,656 bytes per MUT second

   Productivity  92.9% of total user, 92.9% of total elapsed

 }}}

 This is probably because of cache effects, although apparently there seem
 to be strictly less instructions executed in the `allow-cg` case, weird.

 Also I don't think that enlarging the nursery to fit all allocations is a
 realistic benchmark scenario.
 The takeaway here is that lambda lifting stuff somehow has beneficial
 effects on GC, even if overall allocations go up and more collections
 happen as a consequence.

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:55
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler