Re: [GHC] #9476: Implement late lambda-lifting

4 Jul 2018

      #9476: Implement late lambda-lifting
-------------------------------------+-------------------------------------
        Reporter:  simonpj           |                Owner:  nfrisby
            Type:  feature request   |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  7.8.2
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #8763             |  Differential Rev(s):
       Wiki Page:  LateLamLift       |
-------------------------------------+-------------------------------------

Comment (by sgraf):

 It took me quite some time, but
 [https://github.com/sgraf812/ghc/tree/c1f16ac245ca8f8c8452a5b3c1f116237adcb57...
 this commit] passes `./validate` (modulo 4 compiler perf tests). Fixing
 the testsuite was rather simple, but investigating various performance
 regressions to see which knobs we could turn is really time consuming, so
 I figured I better post now than never.

 I updated the wiki page with a summary of changes I made. For
 completeness:

 - A hopefully faithful rebase, removing previous LNE (= join point)
 detection logic
 - Activate all LLF flags (see the above llf-nr10-r6 configuration) by
 default
 - Actually use the `-fllf-nonrec-lam-limit` setting
 - Don't stabilise Unfoldings mentioning `makeStatic`
 - Respect RULEs and Unfoldings in the identifier we abstract over
 (previously, when SpecConstr added a RULE mentioning an otherwise absent
 specialised join point, we would ignore it, which is not in line with how
 CoreFVs works)
 - Stabilise Unfoldings only when we lifted something out of a function
 (Not doing so led to a huge regression in veritas' Edlib.lhs)

 I'll attach nofib results in a following post. Here's the summary:

 {{{
         Program         Allocs    Allocs    Instrs    Instrs
                         no-llf     llf      no-llf      llf
 --------------------------------------------------------------------------------
             Min         -20.3%    -20.3%     -7.8%    -16.5%
             Max          +2.0%     +1.6%    +18.4%    +18.4%
  Geometric Mean          -0.4%     -1.0%     +0.3%     -0.0%
 }}}

 `llf` is a plain benchmark run, whereas `no-llf` means libraries compiled
 with `-fllf`, but benchmarks compiled with `-fno-llf`. This is a useful
 baseline, as it allows to detect test cases where the regression actually
 happens in the test case rather than somewhere in the boot libraries.

 Hardly surprising, allocations go down. More surprisingly, not in a
 consistent fashion. The most illustrating test case is `real/pic`:

 {{{
                          no-llf     llf
             pic          -0.0%     +1.0%     +0.0%     -3.4%
 }}}

 The lifting of some functions results in functions of rather big result
 arity (6 and 7), which no longer can be fast called. Appearently, there's
 no `stg_ap_pppnn` variant matching the call pattern.

 Also, counted instructions went up in some cases, so that there's no real
 win to be had. If I completely refrain from lifting non-recursive join
 points, things look better wrt. to counted instructions:

 {{{
         Program         Allocs    Allocs    Instrs    Instrs
                         no-llf     llf      no-llf      llf
 --------------------------------------------------------------------------------
             Min         -20.3%    -20.3%     -3.4%    -17.1%
             Max          +2.0%     +1.6%     +6.4%     +6.4%
  Geometric Mean          -0.3%     -1.0%     +0.1%     -0.4%
 }}}

 But I recently questioned using cachegrind results (such as the very
 relevant counted memory reads/writes) as a reliable metric (#15333).

 There are some open things that should be measured:

 - Is it worthwhile at all to lift join points? (Related: don't we rather
 want 'custom calling conventions' that inherits register/closure
 configurations to top-level bindings?)
 - Isn't a reduction in allocations a lie when all we did is spill more on
 to the stack? Imagine we lift a (non-tail-recursive) function to top-level
 that would have arity > 5. Arguments would have to be passed on the stack,
 for each recursive call. I'd expect that to be worse than the status quo.
 So maybe don't just count the number of free ids we abstract over, but
 rather bound the resulting arity?

 Finally, the whole transformation feels more like it belongs in the STG
 layer: We very brittly anticipate CorePrep and have to pull in really low-
 level stuff into the analysis, all while having to preserve unfoldings
 when we change anything. Seems like a very local optimization (except for
 enabling intra-module inlining opportunities) that doesn't enable many
 other core2core optimizations (nor should it, that's why we lift late).

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:13
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler