
#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): It took me quite some time, but [https://github.com/sgraf812/ghc/tree/c1f16ac245ca8f8c8452a5b3c1f116237adcb57... this commit] passes `./validate` (modulo 4 compiler perf tests). Fixing the testsuite was rather simple, but investigating various performance regressions to see which knobs we could turn is really time consuming, so I figured I better post now than never. I updated the wiki page with a summary of changes I made. For completeness: - A hopefully faithful rebase, removing previous LNE (= join point) detection logic - Activate all LLF flags (see the above llf-nr10-r6 configuration) by default - Actually use the `-fllf-nonrec-lam-limit` setting - Don't stabilise Unfoldings mentioning `makeStatic` - Respect RULEs and Unfoldings in the identifier we abstract over (previously, when SpecConstr added a RULE mentioning an otherwise absent specialised join point, we would ignore it, which is not in line with how CoreFVs works) - Stabilise Unfoldings only when we lifted something out of a function (Not doing so led to a huge regression in veritas' Edlib.lhs) I'll attach nofib results in a following post. Here's the summary: {{{ Program Allocs Allocs Instrs Instrs no-llf llf no-llf llf -------------------------------------------------------------------------------- Min -20.3% -20.3% -7.8% -16.5% Max +2.0% +1.6% +18.4% +18.4% Geometric Mean -0.4% -1.0% +0.3% -0.0% }}} `llf` is a plain benchmark run, whereas `no-llf` means libraries compiled with `-fllf`, but benchmarks compiled with `-fno-llf`. This is a useful baseline, as it allows to detect test cases where the regression actually happens in the test case rather than somewhere in the boot libraries. Hardly surprising, allocations go down. More surprisingly, not in a consistent fashion. The most illustrating test case is `real/pic`: {{{ no-llf llf pic -0.0% +1.0% +0.0% -3.4% }}} The lifting of some functions results in functions of rather big result arity (6 and 7), which no longer can be fast called. Appearently, there's no `stg_ap_pppnn` variant matching the call pattern. Also, counted instructions went up in some cases, so that there's no real win to be had. If I completely refrain from lifting non-recursive join points, things look better wrt. to counted instructions: {{{ Program Allocs Allocs Instrs Instrs no-llf llf no-llf llf -------------------------------------------------------------------------------- Min -20.3% -20.3% -3.4% -17.1% Max +2.0% +1.6% +6.4% +6.4% Geometric Mean -0.3% -1.0% +0.1% -0.4% }}} But I recently questioned using cachegrind results (such as the very relevant counted memory reads/writes) as a reliable metric (#15333). There are some open things that should be measured: - Is it worthwhile at all to lift join points? (Related: don't we rather want 'custom calling conventions' that inherits register/closure configurations to top-level bindings?) - Isn't a reduction in allocations a lie when all we did is spill more on to the stack? Imagine we lift a (non-tail-recursive) function to top-level that would have arity > 5. Arguments would have to be passed on the stack, for each recursive call. I'd expect that to be worse than the status quo. So maybe don't just count the number of free ids we abstract over, but rather bound the resulting arity? Finally, the whole transformation feels more like it belongs in the STG layer: We very brittly anticipate CorePrep and have to pull in really low- level stuff into the analysis, all while having to preserve unfoldings when we change anything. Seems like a very local optimization (except for enabling intra-module inlining opportunities) that doesn't enable many other core2core optimizations (nor should it, that's why we lift late). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler