[GHC] #9476: Implement late lambda-lifting

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Keywords: | Operating System: Architecture: Unknown/Multiple | Unknown/Multiple Difficulty: Unknown | Type of failure: Blocked By: | None/Unknown Related Tickets: | Test Case: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- This ticket exists to coordinate work on Nick Frisby's '''late lambda- lifting''' optimisation. * Wiki page: [wiki:LateLamLift]. All the details are here. * Branch in GHC repo: `wip/llf` * Related tickets * #9279: local closure remains in final program * #9374: static argument transformation -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature | Status: new request | Milestone: Priority: normal | Version: 7.8.2 Component: Compiler | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by simonpj): * owner: => nfrisby -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature | Status: new request | Milestone: Priority: normal | Version: 7.8.2 Component: Compiler | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by simonpj): Thoughts about the wiki page (which is a fantastic start): * The "extended consequences" section is great. But could you give a small example of each point? Otherwise it is hard to grok. There are lots of other places where an example would be extremely useful. To take just one "If an LNE function f occurs in another LNE function g and we only lift g, then it will spoil f: and it will no longer be an LNE, because it will now be an argument to llf_g." * I recall that you implemented a more-or-less complex analysis to try to get the good consequences without the bad ones. Is it worth sketching what the analysis does? The complexity here is my principal worry about the whole thing. * Small point: the `$(theRHS 'a 'b)` notation is more distracting than helpful. I'd use something more informal `(..a...b...)`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by simonpj: Old description:
This ticket exists to coordinate work on Nick Frisby's '''late lambda- lifting''' optimisation.
* Wiki page: [wiki:LateLamLift]. All the details are here.
* Branch in GHC repo: `wip/llf`
* Related tickets * #9279: local closure remains in final program * #9374: static argument transformation
New description: This ticket exists to coordinate work on Nick Frisby's '''late lambda- lifting''' optimisation. * Wiki page: [wiki:LateLamLift]. All the details are here. * Branch in GHC repo: `wip/llf` * Related tickets * #9279: local closure remains in final program * #9374: static argument transformation * #5945 * #11284 -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * cc: bgamari (added) Old description:
This ticket exists to coordinate work on Nick Frisby's '''late lambda- lifting''' optimisation.
* Wiki page: [wiki:LateLamLift]. All the details are here.
* Branch in GHC repo: `wip/llf`
* Related tickets * #9279: local closure remains in final program * #9374: static argument transformation * #5945 * #11284
New description: This ticket exists to coordinate work on Nick Frisby's '''late lambda- lifting''' optimisation. * Wiki page: [wiki:LateLamLift]. All the details are here. * Branch in GHC repo: `wip/llf` * Related tickets * #9279: local closure remains in final program * #9374: static argument transformation -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by bgamari: Old description:
This ticket exists to coordinate work on Nick Frisby's '''late lambda- lifting''' optimisation.
* Wiki page: [wiki:LateLamLift]. All the details are here.
* Branch in GHC repo: `wip/llf`
* Related tickets * #9279: local closure remains in final program * #9374: static argument transformation
New description: This ticket exists to coordinate work on Nick Frisby's '''late lambda- lifting''' optimisation. * Wiki page: [wiki:LateLamLift]. All the details are here. * Branch in GHC repo: `wip/llf` * Related tickets * #9279: local closure remains in final program * #9374: static argument transformation * #5945 * #11284 -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by bgamari): * wikipage: => LateLamLift -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Description changed by simonpj: Old description:
This ticket exists to coordinate work on Nick Frisby's '''late lambda- lifting''' optimisation.
* Wiki page: [wiki:LateLamLift]. All the details are here.
* Branch in GHC repo: `wip/llf`
* Related tickets * #9279: local closure remains in final program * #9374: static argument transformation * #5945 * #11284
New description: This ticket exists to coordinate work on Nick Frisby's '''late lambda- lifting''' optimisation. * Wiki page: [wiki:LateLamLift]. All the details are here. * Branch in GHC repo: `wip/llf` * Related tickets * #9279: local closure remains in final program * #9374: static argument transformation * #5945 (closed as dup, but with useful example) * #11284 (closed as dup, but with useful example) -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by simonpj): * cc: kavon (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by sgraf): * related: => #8763 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by sgraf): * cc: sgraf (added) Comment: I rebased Nicholas Frisby's work here: https://github.com/sgraf812/ghc/tree/llf. It doesn't currently pass the testsuite (even in default mode, not doing any llf), probably because I introduced some errors while merging some decision code and taking care of join point (then still running by let-no- escape) information. In particular, the devel2 version triggers an ASSERT in `add_id`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj): OK, great. Do keep us posted. I think there are real wins to be had here. Nicholas's detailed wiki page identifies many of the issues very clearly. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): It took me quite some time, but [https://github.com/sgraf812/ghc/tree/c1f16ac245ca8f8c8452a5b3c1f116237adcb57... this commit] passes `./validate` (modulo 4 compiler perf tests). Fixing the testsuite was rather simple, but investigating various performance regressions to see which knobs we could turn is really time consuming, so I figured I better post now than never. I updated the wiki page with a summary of changes I made. For completeness: - A hopefully faithful rebase, removing previous LNE (= join point) detection logic - Activate all LLF flags (see the above llf-nr10-r6 configuration) by default - Actually use the `-fllf-nonrec-lam-limit` setting - Don't stabilise Unfoldings mentioning `makeStatic` - Respect RULEs and Unfoldings in the identifier we abstract over (previously, when SpecConstr added a RULE mentioning an otherwise absent specialised join point, we would ignore it, which is not in line with how CoreFVs works) - Stabilise Unfoldings only when we lifted something out of a function (Not doing so led to a huge regression in veritas' Edlib.lhs) I'll attach nofib results in a following post. Here's the summary: {{{ Program Allocs Allocs Instrs Instrs no-llf llf no-llf llf -------------------------------------------------------------------------------- Min -20.3% -20.3% -7.8% -16.5% Max +2.0% +1.6% +18.4% +18.4% Geometric Mean -0.4% -1.0% +0.3% -0.0% }}} `llf` is a plain benchmark run, whereas `no-llf` means libraries compiled with `-fllf`, but benchmarks compiled with `-fno-llf`. This is a useful baseline, as it allows to detect test cases where the regression actually happens in the test case rather than somewhere in the boot libraries. Hardly surprising, allocations go down. More surprisingly, not in a consistent fashion. The most illustrating test case is `real/pic`: {{{ no-llf llf pic -0.0% +1.0% +0.0% -3.4% }}} The lifting of some functions results in functions of rather big result arity (6 and 7), which no longer can be fast called. Appearently, there's no `stg_ap_pppnn` variant matching the call pattern. Also, counted instructions went up in some cases, so that there's no real win to be had. If I completely refrain from lifting non-recursive join points, things look better wrt. to counted instructions: {{{ Program Allocs Allocs Instrs Instrs no-llf llf no-llf llf -------------------------------------------------------------------------------- Min -20.3% -20.3% -3.4% -17.1% Max +2.0% +1.6% +6.4% +6.4% Geometric Mean -0.3% -1.0% +0.1% -0.4% }}} But I recently questioned using cachegrind results (such as the very relevant counted memory reads/writes) as a reliable metric (#15333). There are some open things that should be measured: - Is it worthwhile at all to lift join points? (Related: don't we rather want 'custom calling conventions' that inherits register/closure configurations to top-level bindings?) - Isn't a reduction in allocations a lie when all we did is spill more on to the stack? Imagine we lift a (non-tail-recursive) function to top-level that would have arity > 5. Arguments would have to be passed on the stack, for each recursive call. I'd expect that to be worse than the status quo. So maybe don't just count the number of free ids we abstract over, but rather bound the resulting arity? Finally, the whole transformation feels more like it belongs in the STG layer: We very brittly anticipate CorePrep and have to pull in really low- level stuff into the analysis, all while having to preserve unfoldings when we change anything. Seems like a very local optimization (except for enabling intra-module inlining opportunities) that doesn't enable many other core2core optimizations (nor should it, that's why we lift late). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): [http://pasted.co/f407cea9 plain base, no-llf, llf] [http://pasted.co/62927f9b same without lifting non-rec join points] -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): I'm currently thinking about Pros and Cons involved with (re-)implementing this as an optimization on STG rather than Core. Pros: - Much less involved analysis that doesn't need to stay in sync with CorePrep - LLF only enables intra-module opportunities in the simplifier anyway - Runtime arguments are immediately visible - Lifting to top-level only is orthogonal enough to full laziness that it might be worthwhile to split anyway - Integrating more low-level information might be more feasible (like calling convention, e.g. number of argument registers, or `stg_ap_*` patterns) Cons: - The Simplifier might not be able to inline as much (which probably had limited effect anyway) - It's not enough to look at Core alone to gauge allocation (quite a bummer) - No Core Lint (but STG Lint, still) - Potentially we need to re-invent some utility functions we already have for Core - Perform the lifting by hand, rather than let `FloatOut` do the work I'm really interested to hear opinions/objections about this! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj): Thoughts * There are a handful of spectacular reductions in allocation (queens, n-body). It'd be good to understand and explain them. Perhaps we can more closely target LLF on those cases. * I don't think we should float join points at all, recursive or non- recursive. Think of them like labels in a control-flow graph. * I think of LLF as a code-generation strategy, that we do once all other transformations are done. (Lambda-lifting ''can'' affect earlier optimisations. It can make a big function into a small one (by floating out its guts), and thereby let it be inlined. But that is subtle and difficult to get consistent gains for. Let's not complicate LLF by thinking about this.) * Given that it's a code-gen strategy, doing it on STG makes perfect sense to me. You've outlined the pros and cons well. Definitely worth a try. I'm not sure what you meant by "It's not enough to look at Core alone to gauge allocation" as a disadvantage. When you say "Much less involved analysis that doesn't need to stay in sync with CorePrep", I think it would be v helpful to lay out "the analysis". I have vague memories, but I don't know what this "stay in sync" stuff is about. If you do it in STG you don't need to explain "stay in sync", but explaining the analysis would be excellent. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): Replying to [comment:16 simonpj]:
Thoughts
* There are a handful of spectacular reductions in allocation (queens, n-body). It'd be good to understand and explain them. Perhaps we can more closely target LLF on those cases.
It's hard to tell for `n-body`: The `-fno-llf` variant shows the same
reduction in allocations, meaning that the reduction must happen somewhere
in the base libraries rather than in `n-body` itself.
`queens` has its `go1` function within `gen` lifted (arity 1 with 3 free
vars). Non-lifted Core after CorePrep:
{{{
let {
n_s5S3 [Occ=OnceL*] :: [[GHC.Types.Int]]
[LclId]
n_s5S3 = go_s5RY ys_s5S2 } in
case GHC.Prim.># 1# ww1_s5RX of {
__DEFAULT ->
letrec {
go1_s5S5 [Occ=LoopBreaker]
:: GHC.Prim.Int# -> [[GHC.Types.Int]]
[LclId, Arity=1, Str=, Unf=OtherCon []]
go1_s5S5
= \ (x1_s5S6 :: GHC.Prim.Int#) ->
case Main.main_$ssafe y_s5S1 1# x1_s5S6 of {
GHC.Types.False ->
case GHC.Prim.==# x1_s5S6 ww1_s5RX of {
__DEFAULT ->
case GHC.Prim.+# x1_s5S6 1#
of sat_s5S9 [Occ=Once]
{ __DEFAULT ->
go1_s5S5 sat_s5S9
};
1# -> n_s5S3
};
GHC.Types.True ->
let {
sat_s5Se [Occ=Once] :: [[GHC.Types.Int]]
[LclId]
sat_s5Se
= case GHC.Prim.==# x1_s5S6 ww1_s5RX
of {
__DEFAULT ->
case GHC.Prim.+# x1_s5S6 1#
of sat_s5Sd [Occ=Once]
{ __DEFAULT ->
go1_s5S5 sat_s5Sd
};
1# -> n_s5S3
} } in
let {
sat_s5Sa [Occ=Once] :: GHC.Types.Int
[LclId]
sat_s5Sa = GHC.Types.I# x1_s5S6 } in
let {
sat_s5Sb [Occ=Once] :: [GHC.Types.Int]
[LclId]
sat_s5Sb
= GHC.Types.:
@ GHC.Types.Int sat_s5Sa y_s5S1 } in
GHC.Types.:
@ [GHC.Types.Int] sat_s5Sb sat_s5Se
}; } in
go1_s5S5 1#;
1# -> n_s5S3
}
}}}
And when `go1` is lifted to top-level, the CorePrep'd call site changes to
{{{
case GHC.Prim.># 1# ww1_s5Ts of {
__DEFAULT ->
let {
sat_s5Tz [Occ=Once, Dmd=
* I don't think we should float join points at all, recursive or non- recursive. Think of them like labels in a control-flow graph.
* I think of LLF as a code-generation strategy, that we do once all other transformations are done. (Lambda-lifting ''can'' affect earlier optimisations. It can make a big function into a small one (by floating out its guts), and thereby let it be inlined. But that is subtle and difficult to get consistent gains for. Let's not complicate LLF by
* Given that it's a code-gen strategy, doing it on STG makes perfect sense to me. You've outlined the pros and cons well. Definitely worth a
This was also what I thought, but there can be synergetic effects with the inliner if we manage to "outline" (that's what the transformation would be called in an imperative language) huge non-recursive join points, where the call overhead is neglibigle. At least that's what the wiki page mentions. But that brings me to the next point... thinking about this.) Yes, I wasn't sure if having it play nice with the simplifier was intended here. I take comfort in your confirmation seeing it as a code-gen strategy. try. Will do.
I'm not sure what you meant by "It's not enough to look at Core alone to gauge allocation" as a disadvantage.
Maybe it's just me only slowly understanding every layer of compilation in GHC, but I felt like I could have this mental model of roughly where and how much things allocate after CorePrep, but that's probably an illusion considering things like let-no-escapes (that story got better with join points, I suppose). Having another pass transforming heap allocation into stack allocation *after* CorePrep isn't exactly helping that intuition. Or were you thrown off by me using words I maybe mistranslated ("gauge" for "estimate", roughly)?
When you say "Much less involved analysis that doesn't need to stay in sync with CorePrep", I think it would be v helpful to lay out "the analysis". I have vague memories, but I don't know what this "stay in sync" stuff is about.
There is [https://github.com/sgraf812/ghc/blob/f5c160c98830fdba83faa9a0634eeab38dbe04d... this note] that explains why it's necessary to approximate CorePrep. What follows is [https://github.com/sgraf812/ghc/blob/f5c160c98830fdba83faa9a0634eeab38dbe04d... this new analysis] that interleaves a free variable analysis with computing use information for id's which are free in the floating bindings.
If you do it in STG you don't need to explain "stay in sync", but explaining the analysis would be excellent.
Yes! I'm not really sure I understand all of it yet. (Re-)implementing it on STG should help me figure things out. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj):
A late demand analysis run would probably detect the same.
-O2 does late demand analysis for this very reason. So presumably it doesn't catch it.
"It's not enough to look at Core alone to gauge allocation"
I agree that STG has a stronger operational model than Core. (Core is not bad, but STG is better.) But that's an ''advantage'' of doing LLF in STG, not a ''disadvantage''. Do have a go! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): I reimplemented this as an STG pass and just rebased it onto a recent 8.6 alpha commit. You can track the progress on my [https://github.com/sgraf812/ghc/tree/22ce14be00f73b63c24e584dda1ad5e1252e57b... `lift-lambda` branch, of which this is the current HEAD]. It passes `./validate`, but due to nofib's `lambda` benchmark not made fit for the next step of the MonadFail proposal, we can't run benchmarks. I ran some benchmarks before the rebase, which was relative to [https://github.com/ghc/ghc/tree/bb539cfe335e this commit]. Here are the numbers, eliding some uninteresting ones: {{{ -------------------------------------------------------------------------------- Program Allocs Instrs -------------------------------------------------------------------------------- VSD -0.3% +0.0% VSM 0.0% -0.1% atom -0.8% -1.2% binary-trees -0.0% -1.5% circsim -1.0% -0.6% clausify -1.9% -0.2% constraints -0.9% -0.7% cryptarithm1 -2.8% -8.0% cryptarithm2 -4.0% -2.9% exact-reals -2.1% -0.1% fft2 -1.0% -0.4% fluid -0.9% -0.6% hidden -1.0% -0.6% k-nucleotide -0.0% +2.4% lcss -0.1% -1.3% mate -3.2% -0.6% mkhprog -1.3% +1.0% power -0.7% -1.0% puzzle -0.0% +0.0% queens -17.7% -0.9% reptile -0.3% -17.1% rewrite -1.7% -0.1% typecheck -2.7% -1.9% -------------------------------------------------------------------------------- Min -17.7% -17.1% Max 0.0% +4.2% Geometric Mean -0.5% -0.3% }}} So, no regressions in allocation (which I'd consider a bug) and some small gains in general. Note that because of ticket:15333#comment:2 I'm not really trusting many of the regressions in counted instructions. I'd test with the proposed `-fllvm -optlo -Os` configuration, but that will have to wait until nofib is fixed. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: nfrisby Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by sgraf): * related: #8763 => #8763 #13286 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by sgraf): * owner: nfrisby => sgraf -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by simonpj): * keywords: => LateLamLift -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): Over the last week, I polished this some more and played around with heuristics regarding known and unknown calls. I'm pretty confident the transformation can handle every idea outlined in Wiki:LateLamLift. Here are is an excerpt from a benchmarks run: {{{ -------------------------------------------------------------------------------- Program Allocs Instrs -------------------------------------------------------------------------------- cryptarithm1 -2.8% -7.9% cryptarithm2 -4.0% -2.4% exact-reals -2.1% -0.0% k-nucleotide -0.0% +2.4% kahan -0.4% -2.0% lcss -0.1% -5.8% mate -8.4% -3.5% n-body -20.2% -0.0% queens -17.7% -0.8% typecheck -2.7% -1.8% -------------------------------------------------------------------------------- Min -20.2% -7.9% Max +0.0% +2.4% Geometric Mean -0.8% -0.3% }}} I had to fight with a surprising amount of benchmark wibbles (+-12% in instructions, c.f. #15333) related to GC parameters and code layout (probably), although I was using cachegrind for measurements. The above numbers are from running NoFib with `make EXTRA_RUNTEST_OPTS='-cachegrind +RTS -V0 -A128M -H1G -RTS' NoFibRuns=1`. I looked at the regression in `k-nucleotide` (+2.4% counted instructions), but couldn't reproduce so far. I'll probably refactor the one huge module into 2 or 3 smaller ones and then prepare a patch. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:23 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj): This sounds great, thank you! There are some knobs you can turn in LLF, aren't there. E.g. if a function has 200 free vars, we probably don't want to lambda-lift it an turn it into a function with 200 parameters. Did you experiment with different settings of the threshold for number-of- free-vars? There may be more knobs too. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:24 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): I think Nicolas played around with those parameters, concluding that this wiki:LateLamLift#llf-nr10-r6 was the best configuration. I played around with settings like the number of vars to abstract over when I was still testdriving the Core transformation. IIRC, the desired cutoff was 5 arguments in total, because any excess arguments are passed on the stack. But now that the transformation works reliably, I think I can play around with it some more again. There are some knobs to turn: There's `-fstg-lift-lams-args(-any)` (default: number of available registers for argument passing, currently hard-coded to 5) which takes an upper bound for the number of arguments the function may have ''after'' lifting. Also no distinction if the function is recursive or not (yet?). As well as `-f(no-)stg-lift-lams-known` (default: no) which controls if we are willing to turn known calls into unknown calls. This happens when we lift a function that abstracts over a known function: {{{ let f x = ... mapF xs = ... f x ... in mapF [1,2,3] ==> mapF_up f xs = ... f x ... let f x = ... in mapF_up f [1,2,3] }}} Lifting `mapF` turns the call to `f` into an unknown call. Wrt. the other heuristics I employ, peek here if you're interested: https://github.com/sgraf812/ghc/blob/1e45f1fe3133a263694d05d01b84a245d424409... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:25 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj): OK, it sounds as if you are exploring the space really well -- thanks -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:26 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by sgraf): * Attachment "unknown.txt" added. allow turning known into unknown calls -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by sgraf): * Attachment "rec-5-6-8-10.txt" added. Varying max number of recursive parameters -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by sgraf): * Attachment "nonrec-5-6-8-10.txt" added. Va -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by sgraf): * Attachment "8-8.txt" added. Allowing max 8 (non-)recursive parameters -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting
-------------------------------------+-------------------------------------
Reporter: simonpj | Owner: sgraf
Type: feature request | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.8.2
Resolution: | Keywords: LateLamLift
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #8763 #13286 | Differential Rev(s):
Wiki Page: LateLamLift |
-------------------------------------+-------------------------------------
Comment (by sgraf):
I investigated various variations of the configuration that intuitively
should give the best results. Specifically, I played with 3 parameters:
1. `-fstg-lift-lams-known`: Allow turning known into unknown calls.
Default: Don't allow.
Imagine we lift `f` in the following program:
{{{
let g x = ...
f y = ... g y ...
in f 2
==>
f g y = ... g y ...;
let g x = ...
in f g 2
}}}
This turns the known call to `g` within `f`s body into an unknown one
(let's call it ''anonymisation'' for the sake of this post), which needs
to go through one of the generic apply functions. My gut tells me this
probably isn't worth taking the risk: There's the (very small) chance,
that when there's no matching slow call pattern, this will turn into a
very slow call, which, from my understanding, would allocate.
2. `-fstg-lift-lams-rec-args <n>`: Allow lifting as long as the resulting
recursive function has at most n arguments. Default: 5, the number of
available registers.
Excess arguments are passed on the stack, which would mean the same
slow memory accesses we try to avoid. Still, allocating on the stack vs.
on the heap might be favorable, but I'm not sure how passing arguments on
the stack plays with (non-tail) recursion, e.g. would passing arguments on
the stack mean we had to pass the same, static args all over again for a
nested recursive activation?
Anyway, I measured.
3. `-fstg-lift-lams-nonrec-args <n>`: Allow lifting as long as the
resulting non-recursive function has at most n arguments. Default: 5.
Lifting non-recursive functions could have different effects than
lifting recursive ones, because a) there's no recursive calls, we pay call
overhead only once b) they are probably huge enough that call overhead is
neglible.
I'll abbreviate each configuration I tested by a triple
`{t,f}-<nrec>-<nnonrec>`, so the (current) default parameter would be
`f-5-5`.
- `t-5-5` -- or: allow turning known call into unknown calls. Total mean
changes: -0.0% allocs, -0.0% counted instructions
Numbers in attachment:unknown.txt. No regressions in allocations (that's
what we have the cost model for, after all), with two benchmarks standing
out:
* `rewrite`: -1.5% allocs, -0.1% counted instructions. Changes were
somewhere in `base`, so didn't bother to investigate further
* `mate`: -0.9% allocs, -0.1% counted instructions. This one lifted
recursive functions of the form
{{{
let {
$warg_scDq [InlPrag=NOUSERINLINE[2],
Occ=OnceL*!,
Dmd=, Unf=OtherCon []] =
sat-only [go_scKg $warg_scDq] \r [ds_scKh]
case ds_scKh of {
[] -> [] [];
: y_scKj [Occ=Once!] ys_scKk [Occ=Once] ->
case y_scKj of {
(,) ww3_scKm [Occ=Once] ww4_scKn [Occ=Once] ->
let {
sat_scKo [Occ=Once] :: [(Move.MoveInFull,
Board.Board)]
[LclId] =
[ys_scKk go_scKg] \u [] go_scKg ys_scKk;
} in $warg_scDq ww3_scKm ww4_scKn sat_scKo;
};
};
} in go_scKg ww_scCR;
}}}
Which is exactly the kind of lift that I tought we don't want to make:
lifting `go` to top-level will result in abstracting over `$warg`, which
will turn the known call into an unknown one. Perhaps this is only
beneficial because the unknown call isn't on the hot path.
- `f-<n>-5`: This varied max number of recursive args between 5 and 10
(attachment:rec-5-6-8-10.txt).
Allowing 6 parameters lifted some more functions, 8 parameters didn't do
anything more than 6 and 10 parameters did another influential lift (-7.7%
allocs in `mandel2`, but +0.3% in ci). The only real beneficial lift here
was in `fibheaps`, happening with n >= 6 (-0.5% on both allocs and ci).
The rest seems to be just noise.
So, what happened in `fibheaps`? It seems two recursive functions were
lifted, both taking 6 arguments. Ah, but one of them (the last, in
particular) is a 'void' parameter (so, slow call pattern pppppv), which is
completely erased from the resulting Cmm! ... the tip of my branch should
allow the lift here now.
- `f-5-<n>`: This varied max number of non-recursive args between 5 and 10
(attachment:nonrec-5-6-8-10.
Allowing up to 8 parameters had great effect on allocations in
`cichelli` (-10.4%), while also improving counted instructions negligibly
(-0.0%). Allowing 10 parameters also had a tiny effect on `simple` (-0.9%
allocs, -0.1%). Codegen for both benchmarks reveals that the changes hide
somewhere in `base`, so I'm not investigating further at the moment, seems
like it's not worth the time.
- `f-8-8`: To test the interaction of both tuning parameters. No
surprising results: attachment:8-8.txt (the baseline doesn't use the
`fibheaps` opportunity which is now optimised in `f-5-5`)
I didn't bother evaluating the combination of allowing anonymisation of
calls with the max bound on parameters, because I think they are largely
independent of another.
Should I evaluate more variants, like allowing to lift undersaturated
applications (are there even any in STG? Thought not, but then there's
`satCallsOnly`)? I don't think these would be worthwhile, except when the
resulting PAP is on a cold path...
-----
So, here's a **TLDR** some questions:
1. Should we allow anonymisation (see 1) above)? I don't see big wins
(configuration `t-5-5`), and don't really understand why there are any at
all, which probably is the bigger problem here.
2. I think the `f-5-5` configuration is a good default. Can you think of
any other parameters I could vary? Looking at the Wiki page, I think these
are the other ones Nicolas evaluated (which was on Core, which is a lot
less explicit about these things):
i. Abstracting over/lifting LNEs: That's a nono, as the last few years
showed
ii. Abstract Undersaturated applciations: As I wrote above, I don't
think these are worthwhile, because they only shift allocation to call
sites via PAPs (and also they aren't trivially to implement because of
STGs ANF guarantees)
iii. Abstract (over-)saturated applications: That's a no brainer in my
eyes. Oversats are just regular calls returning something unknown we then
call. If the known call is worthwhile to lift, just do it; the unknown
call won't change a bit.
iv. Create PAPs: That's a special case of ii. I think, or at least this
is the case that would entail a special case in the transformation because
of STG only allows atoms as arguments.
v. Use strictness: Seems to be related to anticipating CorePrep, which
fortunately we don't have to any longer.
vi. Use one-shot info: I'm not entirely sure what this means either.
Lifting would spoil one-shot info, but I don't think this has any
operational consequences, because one-shot info was already used to
identify single-entry thunks.
--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:27
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj):
-fstg-lift-lams-known: Allow turning known into unknown calls.
In your example, you say that we might float f but not g. But why don't we float g? Anyway, I agree; it is Very Bad to turn known into unknown calls. Let's not do that. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:28 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj):
Ah, but one of them (the last, in particular) is a 'void' parameter (so, slow call pattern pppppv), which is completely erased from the resulting Cmm!
OK, so the conclusion is: don't count void args in the count? I agree! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:29 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj): The non-recursive case
f-5-<n>: This varied max number of non-recursive args between 5 and 10 (attachment:nonrec-5-6-8-10. Allowing up to 8 parameters had great effect on allocations in cichelli (-10.4%),
Hmm. Consider {{{ let f x = e in ...(f e1)...(f e2)... }}} where `e` has a lot of free vars `y1` .. `y20`. If we lift we get a top- level `f` with 21 args, and the calls look like: {{{ ...(f y1 .. y20 e1) ... (f y1 .. y20 e2)... }}} Now * In the original form we store `y1`..`y20` into the function closure for `f`, once. * In the lifted form we store (most of) `y1`..`y20` on the stack, once for each call. So if we had a lot of calls, there's be more memory traffic. Plus of course, any case-expression evals would need to save `y1`..`y20` on the stack, and the reload them to make the call... I was thinking that maybe for non-rec functions we could claim that lifting was a win regardless of the number of parameters, but I don't think that's true. So it's a bit surprising to me that even with a threshold of 10 we have such a small hit on instruction count. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:30 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj):
iii) Abstract (over-)saturated applications:
I agree. Let's do it.
vi> Use one-shot info:
I took a quick look at Nick's code. It seemed to be to do with NOT floating a binding that looked like {{{ f = \x[oneshot]. blah }}} I'm not sure why - and I think it'd irrelevant by the time we get to STG. So yes, ignore this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:31 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj): * Would you like to publish figures for your best option (`-f-5-5`, don't count void args, float over-sat apps)? * What happens to binary sizes? * I think you should write a paper! It's very easy to forget all the things that are paged into your head at the moment. Get them written down before you forget them! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:32 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): Replying to [comment:28 simonpj]:
-fstg-lift-lams-known: Allow turning known into unknown calls.
In your example, you say that we might float `f` but not `g`. But why don't we float `g`?
Because at the time we handled `g`, it (hypothetically) was deemed non- beneficial to lift. It might lead to more allocation or hurt some other heuristic. Of course one could argue that lifting ''both'' bindings could be a win. Our decision strategy is greedy in that regard; There's no backtracking involved. We transform expressions going from outer to inner, at each step taking into account the lifting decisions we already made (and which we don't question or undo). So, for that example in particular, the assumption was that we examined `g`, decided not to lift it and then look at `f` knowing that `g` won't be lifted.
lifting go to top-level will result in abstracting over `$warg`,
Why didn't we float `$warg`?
Same reasoning. Because we decided earlier that the lift might not be worthwhile. I could look at some debug output to see what was the reason(s) for `$warg` if you are really interested. Replying to [comment:29 simonpj]:
Ah, but one of them (the last, in particular) is a 'void' parameter (so, slow call pattern pppppv), which is completely erased from the resulting Cmm!
OK, so the conclusion is: don't count void args when counting args and comparing with the threshold? I agree!
Yes, exactly. See https://github.com/sgraf812/ghc/blob/9b9260c1d45d127edf9ebdfe04a3daaff24a9de... Replying to [comment:30 simonpj]:
The non-recursive case
So it's a bit surprising to me that even with a threshold of 10 we have such a small hit on instruction count.
I was wondering the same thing. I ''think'' the other heuristics (https://github.com/sgraf812/ghc/blob/9b9260c1d45d127edf9ebdfe04a3daaff24a9de..., for completeness) fire way too often to allow non-recursive calls with 20 arguments. Consider the impact on allocations: When we lift such a function, all 20ish free variables would have to be made available at all call sites, leading to a massive increase in closure allocation for closures around the actual call sites. Example {{{ let f = [x1...x20] \[a b c] -> ... g = [f y] \z -> f y y z in map g [1,2,3] }}} Assuming for the sake of the example that `g` wouldn't be lifted to top- level itself (it's just a PAP of `f`), lifting `f` would just shift the whole closure into `g`'s closure. In general, there could be arbitrary more closures between `f`s binding site and its call sites, each of which would have to carry `f`s former free variables, should we decide to lift it. I could try and measure with the allocation heuristic and the max args heuristic turned off, to see if we get more chaotic results. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:33 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by sgraf): * Attachment "f-5-5.txt" added. Don't turn known into unknown calls, allow max 5 (non-)recursive non-void parameters after lifting, don't lift things with undersaturated applications or partial applications -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): Replying to [comment:32 simonpj]:
* Would you like to publish figures for your best option (`-f-5-5`, don't count void args, float over-sat apps)?
I'm currently re-running benchmarks with the new void args thing, but I suspect that the results from comment:23 are pretty much the same, except for `fibheaps`. All results with at least 1% change: {{{ -------------------------------------------------------------------------------- Program Allocs Instrs -------------------------------------------------------------------------------- anna -1.1% -0.4% atom -0.8% -1.1% circsim -1.0% -0.7% clausify -1.9% -0.4% cryptarithm1 -2.8% -7.9% cryptarithm2 -4.0% -2.4% exact-reals -2.1% -0.0% expert -1.0% -0.0% fft2 -1.0% -0.3% fibheaps -1.4% -0.7% fluid -1.5% -0.6% hidden -1.0% -0.6% infer -0.6% -1.1% k-nucleotide -0.0% +2.4% kahan -0.4% -2.0% lcss -0.1% -5.8% mate -8.4% -3.5% mkhprog -1.3% -0.1% n-body -20.2% -0.0% nucleic2 -1.0% -0.1% queens -17.7% -0.8% typecheck -2.7% -1.8% -------------------------------------------------------------------------------- Min -20.2% -7.9% Max +0.0% +2.4% Geometric Mean -0.8% -0.3% }}}
* What happens to binary sizes?
Good question, given that we push work from binding sites to call sites. It seems they consistently went down by 0.1%. I wonder why? Maybe some heap checks somewhere in `base`? Or is it just a wibble? Maybe I should re-evaluate my baseline...
* I think you should write a paper! It's very easy to forget all the
things that are paged into your head at the moment. Get them written down before you forget them! Yes, I'd love to! In fact, I started something at https://github.com/sgraf812/late-lam-lift. It's a bit rough and only has a section about the decision heuristics at the moment, but I'll post when there's something you could have a look at. In fact, I'd very much like it if Nicolas and/or you would co-author that with me, as I haven't really written a paper before. I think this would be an excellent exercise. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:34 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj):
I could try and measure with the allocation heuristic ... turned off, to see if we get more chaotic results.
What is the "allocation heuristic"? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:35 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): The allocation heuristic (https://github.com/sgraf812/ghc/blob/9b9260c1d45d127edf9ebdfe04a3daaff24a9de...) tries to estimate closure growth and only allows the lift if it can guarantee that there are no regressions. I think this is quite essential to identify beneficial lifts, as well as non-trivial to implement. The whole `Analysis` module linked above mostly revolves around computing this `costToLift` https://github.com/sgraf812/ghc/blob/9b9260c1d45d127edf9ebdfe04a3daaff24a9de.... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:36 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj):
I think this is quite essential to identify beneficial lifts,
Oh, OK. If I was reading the paper I'd ask "which bits of this analysis really make a difference"? Eg. * If I switched it off altogether, what happens? * If I switch off pieces of it, what happens? We've discussed some of these pieces above, but not all, I think. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:37 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj):
Yes, I'd love to! In fact, I started something at ​https://github.com/sgraf812/late-lam-lift.
Re paper: yes I'd be happy to co-author. Please yell when you think it's ready for a serious look. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:38 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj): Sebastian, how is this going? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:39 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): You can find the latest version of the paper at https://github.com/sgraf812/late-lam-lift/blob/master/paper.pdf. It's still pretty rough and littered with TODOs, but you can take a glance at the two chapters I got so far. What's missing is an evaluation + the usual prologue and epilogue. While writing the paper, I realised a few things here and there that I could change in the transformation. For example, currently we can't lift functions that occur in argument position, i.e. {{{ let f x = ... x ... a ... b ... in let g y z = ... in g f 1 }}} Lifting `f` would mean having to allocate a PAP at the call site of `g` for `f a b`. One of our heuristics forbids the lift here, because it's nonsense: We immediately re-introduce the allocation we spared at each call site of the lifted function. The only situation I can think of in which this could be beneficial is when this pushes allocation from a hot code path into multiple cold call sites. Making the transformation correct here is somewhat of a nuisance, but probably needed in order to back my claims in section 3 (C1, in particular) with benchmark figures. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:40 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

For example, currently we can't lift functions that occur in argument
#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj): Good stuff on the paper. I'll try to look. position, i.e. That seems correct doesn't it? i.e. the implementation doesn't lift such functions, and it shouldn't. So why do you say "I realised a few things here and there that I could change in the transformation"? What about the implementation. Are you done? Ready to submit a Phab patch for review? With an up-to-date wiki page describing what it does? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:41 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf):
That seems correct doesn't it? i.e. the implementation doesn't lift such functions, and it shouldn't.
The implementation doesn't lift these functions because we ''think'' they won't lead to a beneficial lift anyway. But that's rather a heuristic and for illustrative purposes (e.g., look how things get worse when we allow this) we ''could'' implement the transformation in a way that it can cope with these cases. But after having played with it for a few hours yesterday, I came to realise that this has more serious consequences than I anticipated. In addition to the StgApp case, the Constructor application and PrimOp cases are particularly egregious, because they require to return floats along with the actual tranformed RHS. I'm no longer convinced that the illustrative purpose justifies complicating the implementation so much.
So why do you say "I realised a few things here and there that I could change in the transformation"?
What about the implementation. Are you done? Ready to submit a Phab
In the process of writing the paper, I repeatedly had to double-check what the transformation does and that it actually does the right thing. For example, thinking about how to formulate the function computing closure growth resulting from a lifting decision lead me to realise that we in fact could make better use of strictness in addition to usage (i.e. one- shot) information. Example: {{{ let f = [x y]. \z -> ... g = [x y f]. \u -> ... x ... y ... f u ... in g 1 }}} What's the closure growth resulting from lambda lifting `f`? Previously, although `g`s closure would shrink by one slot (so closure growth -1 from `g`s RHS), we can't be certain that this benefit would be visible on every program path, so we'd have to return 0 here. But that's too conservative, because in fact `g` is called strictly with one argument (strictness demand `C(S)`), which means closure growth for the expression is actually -1 (not yet counting in the saved two slots from not having to allocate a closure for `f`). patch for review? With an up-to-date wiki page describing what it does? I consider the implementation done (probably should've submitted a patch much earlier). I'll bring the wiki page up to date and then submit a patch within the next week or so. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:42 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj):
In the process of writing the paper, I repeatedly had to double-check what the transformation does and that it actually does the right thing. For example, thinking about how to formulate the function computing closure growth resulting from a lifting decision lead me to realise that we in fact could make better use of strictness in addition to usage (i.e. one-shot) information
Great. So did you update the impl to take advantage of these new insights? I'll look forwards to the patch. An up to date nofib comparison would be good. And please yell when you are ready for me to take a proper sweep through the paper. Thanks! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:43 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by sgraf): * status: new => patch * differential: => D5224 Comment: I submitted a patch of the implementation at Phab:D5224 and brought the wiki page up to date. This is ready for review, although still lacking a unit test suite. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:44 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by potato44): * differential: D5224 => Phab:D5224 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:45 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): I realised that while the heuristics currently employed work quite well, we are too pessimistic wrt. unsaturated applications. Currently, we don't lift binders that have undersaturated calls [https://github.com/sgraf812/ghc/blob/c4b8c04f26b5045da8015dd7289cd176a4addff... at all]. But there's nothing wrong with lifting those, as long as allocations don't get worse. So I set out to get rid of that heuristic and replaced it with a check that disallows occurrences in argument position instead (previously, this would be subsumed by the undersaturated call check). The results are inconclusive, with some significant slow-downs which I could pin-point to a bug in the heuristic that detects known calls. Currently, we say that a call to some variable would be lowered as a known call when `idArity bndr > 0`. But there actually are thunks (e.g. arity 0) of function type, which count as known calls. Abstracting over these thunks makes for an increase in allocation and runtime. I'm investigating more robust ways of detecting known calls. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:46 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): Nevermind, the regression was due to lifting binders occuring in a nullary applications. Example from `CS`: {{{ main_$s$wgo = \r [sc sc1] case sc1 of wild { __DEFAULT -> let { lvl = \u [] case -# [wild 1#] of sat { __DEFAULT -> main_$s$wgo sc sat; }; } in let { sat = \r [s1] case plusInteger s1 ds of s' { __DEFAULT -> lvl s'; }; } in sat; 0# -> sc (); }; ==> main_$s$wgo = \r [sc sc1] case sc1 of wild { __DEFAULT -> let { lvl = \u [] case -# [wild 1#] of sat { __DEFAULT -> main_$s$wgo sc sat; }; } in $lsat lvl; 0# -> sc (); }; $lsat = \r [lvl s1] case plusInteger s1 ds of s' { __DEFAULT -> lvl s'; }; }}} (`main_$s$wgo` is a ternary function) This is a beneficial lift from the perspective of closure growth, but actually allocates more because the nullary application `sat` turns into a partial application `$lsat lvl` which allocates. Fixing this by regarding nullary applications as argument occurrences of the binder (e.g. a no-go) has the same effect to nofib benchmarks as just disallowing ''any'' undersaturated calls to the binder to lift. The latter is much simpler to check for (just look at the `StgBinderInfo` of the RHS) and is the status quo in Phab:D5224. So, no news, basically. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:47 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: patch Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by sgraf): * milestone: => 8.8.1 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:48 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: patch Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): Re: ticket:15770#comment:11 (StgBinderInfo is going to be removed in Phab:D5232) I can probably just reuse [https://github.com/sgraf812/ghc/compare/sgraf812:46933fe...sgraf812:96d1401 a branch I had lying around] that didn't have any consequences for benchmark performance. Still, I'll probably need to wait for the changes from Phab:D5232 to hit master. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:49 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: patch Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj):
I realised that while the heuristics currently employed work quite well, we are too pessimistic wrt. unsaturated applications.
As I say on Phab, we really need a Note summarising the heuristics. (And of course the paper will help a lot.) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:50 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: patch Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by maoe): * cc: maoe (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:51 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: patch Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by sgraf): * Attachment "nofib.txt" added. Runtime measurements, ignoring all with <200ms runtime, LLVM backend with -Os -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: patch Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): I'm currently trying to find the right configuration for Runtime benchmarking. When using the NCG on the architecture I benchmark on, there are seemingly random outliers performance-wise, even when ignoring benchmarks with less than 200ms running time. Take `CSD` from `real/eff` for example. On the target architecture (i7-6700), things consistently are 4.5% slower, yet ''there isn't a single lifted function in that benchmark''. It's basically just a counting loop. To make matters worse, I can't reproduce this on my local PC, quite the contrary there. Altogether this makes for a very meager improvement of -0.2% in runtime. This leads me to believe that the (relatively minor) benefits are obscured by code size and layout concerns. If I only include benchmarks that ran at least 500ms, things look much better (-0.4%), but that's probably because I excluded the `eff` 'microbenchmarks'. I tried another configuration that probably does better justice to the optimisation: I re-ran the benchmarks with `-fllvm -optlo -Os` to have the LLVM optimise for size concerns which IME yields less code layout dependent results. Anyway, ignoring benchmarks with <200ms runtime yields an improvement of -1.0% (result: https://ghc.haskell.org/trac/ghc/attachment/ticket/9476/nofib.txt), while ignoring all benchmarks with <500ms runtime yields an -1.2% improvement. Ironically, runtime of `CSD` ''improved'' by -7.1%. Notable is also that while `n-body` allocates 20% less (heap space!), it got slower by a non-meaningful margin of 0.1%. Maybe watching out for allocations isn't the be all end all here. I really think we should flag benchmarks for being eligible for runtime measurements. I get hung up on what are architectural wibbles ''all the time''. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:52 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting
-------------------------------------+-------------------------------------
Reporter: simonpj | Owner: sgraf
Type: feature request | Status: patch
Priority: normal | Milestone: 8.8.1
Component: Compiler | Version: 7.8.2
Resolution: | Keywords: LateLamLift
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224
Wiki Page: LateLamLift |
-------------------------------------+-------------------------------------
Comment (by Sebastian Graf

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Changes (by bgamari): * status: patch => closed * resolution: => fixed -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:54 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): I'm currently writing the evaluation chapter of the paper and hit quite a bummer. If I add an option to ''ignore'' closure growth completely, instructions improve by another 0.3% in the mean, with no regressions (!) wrt. the baseline where we try to avoid closure growth. Although total allocations possibly increase, executed instructions ''go down'' in some cases. The most prominent case is `paraffins`: +18.6% more allocations, but -11.7% less executed instructions! Example run: {{{ $ ./default 19 +RTS -s > /dev/null 359,457,968 bytes allocated in the heap 536,016,504 bytes copied during GC 163,983,272 bytes maximum residency (8 sample(s)) 1,699,968 bytes maximum slop 156 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 339 colls, 0 par 0.137s 0.137s 0.0004s 0.0025s Gen 1 8 colls, 0 par 0.386s 0.386s 0.0482s 0.1934s INIT time 0.000s ( 0.000s elapsed) MUT time 0.058s ( 0.058s elapsed) GC time 0.523s ( 0.523s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.581s ( 0.581s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 6,243,240,498 bytes per MUT second Productivity 9.9% of total user, 9.9% of total elapsed $ ./allow-cg 19 +RTS -s > /dev/null 426,433,296 bytes allocated in the heap 488,364,024 bytes copied during GC 139,063,776 bytes maximum residency (8 sample(s)) 2,223,648 bytes maximum slop 132 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 403 colls, 0 par 0.136s 0.136s 0.0003s 0.0010s Gen 1 8 colls, 0 par 0.317s 0.317s 0.0397s 0.1517s INIT time 0.000s ( 0.000s elapsed) MUT time 0.080s ( 0.080s elapsed) GC time 0.453s ( 0.453s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.533s ( 0.533s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 5,359,023,067 bytes per MUT second Productivity 14.9% of total user, 14.9% of total elapsed }}} Note how allocations and number of collections (these are correlated) went up, but bytes copied and GC time (also correlated) went down. The only possible conclusion here is that viewing bytes allocated as a metric for predicting runtime is flawed: What matters most for runtime is bytes copied during GC. Of course there's an overhead for heap checks etc., but it seems that GC pressure far outweighs that. Interestingly, if I 'disable' the GC by allocating 600MB of nursery, the inverse effect manifests: {{{ $ ./default 19 +RTS -s -A600M > /dev/null 359,104,696 bytes allocated in the heap 3,384 bytes copied during GC 44,480 bytes maximum residency (1 sample(s)) 25,152 bytes maximum slop 0 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s INIT time 0.009s ( 0.009s elapsed) MUT time 0.127s ( 0.127s elapsed) GC time 0.000s ( 0.000s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.136s ( 0.136s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 2,821,937,180 bytes per MUT second Productivity 93.4% of total user, 93.5% of total elapsed $ ./allow-cg 19 +RTS -s -A600M > /dev/null 426,014,360 bytes allocated in the heap 3,384 bytes copied during GC 44,480 bytes maximum residency (1 sample(s)) 25,152 bytes maximum slop 0 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s INIT time 0.011s ( 0.011s elapsed) MUT time 0.142s ( 0.142s elapsed) GC time 0.000s ( 0.000s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.153s ( 0.153s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 2,994,250,656 bytes per MUT second Productivity 92.9% of total user, 92.9% of total elapsed }}} This is probably because of cache effects, although apparently there seem to be strictly less instructions executed in the `allow-cg` case, weird. Also I don't think that enlarging the nursery to fit all allocations is a realistic benchmark scenario. The takeaway here is that lambda lifting stuff somehow has beneficial effects on GC, even if overall allocations go up and more collections happen as a consequence. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:55 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): So, why does lambda lifting more bindings lead to better code here? Honestly, I'm at a loss. I identified that lifting `go` below is what seems to lead to a smaller working set, or at least lifting just this function leads to above speedups: {{{ foo = go z where go [] = z2 go (y:ys) = let sat2 = go ys2 sat1 = C x y z in sat1:sat2 x = ... y = ... z2 = ... }}} `go` has three free vars: `x`, `y` and `z2`. Lifting `go` implies an increase in allocations, because the closure for `sat2` grows under a multi-shot lambda: {{{ go x y z2 [] = z2 go x y z2 (y:ys) = let sat2 = go x y z2 ys2 sat1 = C x y z in sat1:sat2 foo = go x y z2 z where x = ... y = ... z2 = ... }}} Yet, the working set for the GC gets smaller, so it seems like the second version produces more, but shorter-lived garbage. Or at least the GC is smarter about the last version than about the first. I'm not sure what makes `go` so special! If I wouldn't have numbers, I'd totally say that it's not a worthwhile lift. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:56 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj):
So, why does lambda lifting more bindings lead to better code here?
Where is "here"? I'm a bit lost. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:57 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): I was referring to the `paraffins` measurements in comment:55. Deactivating the closure growth checks (thus allowing more bindings to lift at the potential cost of more allocations) leads to an overall improvement, most significantly visible in `paraffins` with an increase of 18% in allocations but 11% less executed instructions. The measurements I posted are for runtime. There's a reproducible speedup (0.533s vs 0.581s), too, which is mostly due to trading time in the GC for time in the mutator. That's also apparent by resizing the nursery so that no GC needs to happen (the second pair of measurements with `-A600M`). I regard this as 'better code', because it's faster in the default case, where we don't tune the nursery to fit all allocations, i.e. more mutator time is OK as long as total time goes down. In comment:56 I go on to wonder why that is the case. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:58 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj): Ah, I see. Thanks. It is indeed mysterious to me how allocating more less to less GC time. Paraffins is a reasonably small example, with a small inner loop, so there is some hope of finding out. With `+RTS -Sstderr` you'll get info on GC calls, including the residency numbers. With `-G1` (I think) you'll get just one generation, so every GC does a full sweep and you'll get accurate residency (= live data) info. That's useful as a way to avoid accidents such as the precise moment when major GC strikes. Does the improvement continue with one generation ? (I.e. both before and after with one generation.) With a program this small it must be possible to truly understand where the better GC numbers are coming from. It's be interesting to know if ignoring closure growth improves all programs (or at least does not make them worse). Even we didn't understand why, if it's generally true we can switch it off -- or just simplify the code by removing the analysis. One idea is this. After lambda-lifting, it's possible that `go` is strict in one of its new arguments, such as x, y, or z2. Ahh... but this lambda lifting occurs after strictness analysis, so it can't be that. OK, I'm at a loss. More info needed. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:59 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): Thanks for pointing me to `-G1`, very interesting! The difference in bytes copied and consequently runtime is even more grave: {{{ $ ./default 19 +RTS -s -G1 > /dev/null 359,455,256 bytes allocated in the heap 334,966,000 bytes copied during GC 188,250,032 bytes maximum residency (9 sample(s)) 2,125,824 bytes maximum slop 179 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 9 colls, 0 par 0.431s 0.431s 0.0478s 0.2337s INIT time 0.000s ( 0.000s elapsed) MUT time 0.123s ( 0.123s elapsed) GC time 0.431s ( 0.431s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.554s ( 0.554s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 2,928,555,183 bytes per MUT second Productivity 22.2% of total user, 22.2% of total elapsed $ ./allow-cg 19 +RTS -s -G1 > /dev/null 401,712,312 bytes allocated in the heap 185,583,392 bytes copied during GC 97,712,192 bytes maximum residency (38 sample(s)) 1,275,840 bytes maximum slop 93 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 38 colls, 0 par 0.221s 0.221s 0.0058s 0.1098s INIT time 0.000s ( 0.000s elapsed) MUT time 0.104s ( 0.104s elapsed) GC time 0.221s ( 0.221s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.325s ( 0.325s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 3,878,228,290 bytes per MUT second Productivity 31.9% of total user, 31.9% of total elapsed }}} The residency was cut in half! Also note the difference in number of collections and that MUT is lower than the baseline (that doesn't lift the `go` function above). Probably a caching side-effect of the smaller residency, as the situation is still the same with `-A400M`, where the baseline is faster. I don't know how, but I suspect that lifting `go` causes the GC to be less conservative about liveness of some closure objects. If I had to guess, then something keeps the closure of `go` longer alive than the growing `sat2` thunk in `go`. I played around with heap/retainer profiling, but to no avail yet. Here is a gist with the `-S -G1` output: https://gist.github.com/sgraf812/5fbcf6b81fdd7c8af1a6060832bbfa11 There are two interesting things to point out: 1. The lifted version collects much more often, but only after completing the computing intensive work. Not sure why there are so many of them, seems redundant 2. Compared to the baseline, the residency (and consequently the total heap size, it seems) grows slower, but the increase in total bytes allocated leads to an additional collection before everything drops to constant space. Not sure what to make of that data, but it doesn't contradict what I said about the closure of `go` being kept alive longer than `sat2`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:60 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj): Interesting. The fact that this program has a tiny inner loop makes it feasibly to dig into. If I were doing this, I think the next thing I'd try would be to histogram the heap contents (info-pointer, number of closures with that info pointer) at each GC. This is roughly want heap profiling can do, but it may collapse too many things into one. More ambitiously, the heap profiler is supposed to be able to plot "lag"; that is, stuff that is kept alive, but is never subsequently referenced. If that is still working it'd be interesting to know whether or not the extra residency is hanging onto stuff that is never subsequently referenced. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:61 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): I'd like to give bibliographical profiling a try, but enabling profiling seems to get rid of the effects: No difference in residency anymore. If completely disable closure growth checks (not just for the particular hot loop), residency worsens. Runtime also is consistently worse. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:62 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj): ''Heap'' profiling should change nothing. (In contrast to ''time'' profiling with `-prof`.) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:63 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): If I omit `-prof`, the RTS yells at me when doing anything other than `-hT` or `-h` (what's the difference between these two anyway?). The output of `-hT` is not particularly revealing: https://pp.ipd.kit.edu/~sgraf/hs/. It's climbing to 80MB for `:` and another 100MB for `C`. Interestingly, the maximum residency is higher than the baseline, so it seems it's not possible to observe the effects we are seeing without `-hT`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:64 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): It seems this whole fuzz was about nothing. Obvious in hind-sight, I wasn't aware that the maximum residency and bytes copied are just based on sampling when the GC runs, which in turn depends on heap size. Although I played around with `-A` before, I just ran a script which would find the maximum residency over multiple different heap sizes, to get these numbers: {{{ $ ./default 19 +RTS -s -G1 -A56M 359,289,696 bytes allocated in the heap 313,229,544 bytes copied during GC 192,670,160 bytes maximum residency (4 sample(s)) 2,757,856 bytes maximum slop 183 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 4 colls, 0 par 0.405s 0.405s 0.1013s 0.2541s INIT time 0.001s ( 0.001s elapsed) MUT time 0.119s ( 0.119s elapsed) GC time 0.405s ( 0.405s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.524s ( 0.525s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 3,031,050,062 bytes per MUT second Productivity 22.6% of total user, 22.6% of total elapsed $ ./allow-cg 19 +RTS -s -G1 -A76M 401,485,600 bytes allocated in the heap 331,564,512 bytes copied during GC 196,161,944 bytes maximum residency (4 sample(s)) 1,389,968 bytes maximum slop 187 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 4 colls, 0 par 0.436s 0.441s 0.1102s 0.2637s INIT time 0.001s ( 0.001s elapsed) MUT time 0.132s ( 0.132s elapsed) GC time 0.436s ( 0.441s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.569s ( 0.574s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 3,049,743,734 bytes per MUT second Productivity 23.1% of total user, 23.0% of total elapsed }}} Still, the impact of GC parameters is very annoying. Also, when I vary `-A$iM`, where `$i` is an integer between 1 and 200, in 123 of 200 cases, the baseline has higher maximum residency than when we also decide to lift `go`. It seems that the GC parameterisation for the baseline is just in a bad (bitter?) spot. The question is, how do I sell this in the paper? I guess I could increase nursery size even further (I'm currently benchmarking with `-A128M -H1G`), but that's not very realistic, either... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:65 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): Ah, just remembered that counted instructions were the original reason to look into this and with the NCG, cachegrind reports less instructions for `./allow-cg 17 +RTS -s -A128M -H1G` than for `./default 17 +RTS -s -A128M -H1G`. Compiling with `-fllvm -optlo -Os` works around this, so I'll try a benchmark run with that. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:66 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj): I think there is an option to force GC to happen more frequently, but I can't see it in the manual. Simon Marlow might know. In any case, getting an accurate residency profile, by using one generation and frequent gcs, is a Good Thing. when you do that, to the residency profiles of the two versions differ? I didn't understand the figures you show above. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:67 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): Replying to [comment:67 simonpj]:
I think there is an option to force GC to happen more frequently, but I can't see it in the manual. Simon Marlow might know.
In any case, getting an accurate residency profile, by using one generation and frequent gcs, is a Good Thing. when you do that, to the residency profiles of the two versions differ? I didn't understand the
There's `-i`, but that only seems to work in conjunction with `-h`: {{{ -h Heap residency profile (output file <program>.hp) -i<sec> Time between heap profile samples (seconds, default: 0.1) }}} Also, even with `-i0.0001s -h` the sampled maximum residency doesn't come close to the 192/196MB above. figures you show above. Sorry for causing you trouble to understand what I wrote, I really have to figure out better ways to present my results. So, yeah, using one generation and 'just the right GC frequency' is basically what I did in comment:65. I defined 'just the right frequency' by choosing the `-A$iM` setting (which determines the points at which GC happens) for which maximum residency is at its maximum for each of the two candidates. For the `default`, which looks at closure growth, this setting was `-A56M` with a maximum residency of 192MB. For the version where we also lift said `go` function (which we normally would not), maximum residency was at 196MB for `-A76M`. That's worse than the baseline, as I initially expected. The fuzz above was all for nothing, because I once again merely measured the RTS parameterisation. In comment:66 I remembered that despite better runtime measurements with the baseline, counted instructions regressed when 'turning off' GC with `-A128M`. That's strange, but I'll probably refer to runtime measurements in the paper now and ignore counted instructions completely. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:68 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by osa1): I don't think we have a parameter to force GC to happen more frequently (there's -I but that only applies to idle GC). I think the best you can do is adding a bunch of `performMajorGC`s in your program. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:69 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj): I'm sorry I'm so slow, but I still don't understand the details of comment:68. But no matter! * I think we need a way to get accurate residency profiles, which in turn mean doing frequent, major garbage collections. Then we can present meaningful results, rather than fumble around in the dark. * To do this, either we need to add a RTS flag or... doesn't it happen anyway when we have a very small nursery? Each time the nursery runs out, we GC, and with -G1 that's a major GC. What am I missing? Then I believe you are saying "Ignoring closure growth has no effect on residency". Is that right? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:70 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): Replying to [comment:70 simonpj]:
* To do this, either we need to add a RTS flag or... doesn't it happen anyway when we have a very small nursery? Each time the nursery runs out, we GC, and with -G1 that's a major GC. What am I missing?
Yes, doing `-G1` and varying the initial heap size with `-A$iM` was exactly what I have been doing to find out a close approximation of the maximum residency (the one of related to program semantics, not GC samples). (Note that when a GC happens with `-G1`, the heap is resized to 2*bytes copied according to `-S`, so the 'nursery' grows in `-G1`. Also I don't think that we can vary GC frequency through an RTS flag, because to my knowledge, the decision when to GC is entirely made by the mutator) Example: Let's say I start with `-A40M` (always `-G1` from now on) and let's assume that the picture shows the actual heap profile (e.g., with maximum sampling frequency), where the points in time at which GC runs are indicated in red: [[Image(https://i.imgur.com/RpeuCXc.jpg, 400px)]] The maximum residency observed by GC is smaller than the actual maximum residency, which I call the maximum working set for the remainder of this post for disambiguation. How can we make the sampled maximum residency approximate the maximum working set? We can vary the GC frequency by setting the initial heap size to a different value. It turns out that within the integer range of 1 to 200, `-A56M` seems to be the candidate with the closest approximation (closest meaning it sampled the highest maximum residency): [[Image(https://i.imgur.com/BfvbKSl.jpg, 400px)]] In particular, we don't really care how often major GC happens, only that one major GC happens ''immediately before'' the working set collapses. The `-A56M` is highly specific to the program (in this case, `default` above) and doesn't really matter, it's just the parameterisation where the sampled maximum residency most closely approximates the actual maximum working set. It's similar to modular arithmetic, if that's of any help: Imagine you want to find a number below 20 with the closest multiple to 73 (prime), but not larger than that. There's 8*9=72 or 6*12 or 4*18, etc. It doesn't really matter which factor you choose, point is that you get close to 73. The specific parameterisation for `allow-cg` to approximate its maximum working set was `-A76M -G1`, but it doesn't matter; what matters is that the sampled maximum residency for `default` was lower than `allow-cg` (192 vs. 196MB), so everything as expected.
Then I believe you are saying "Ignoring closure growth has no effect on residency". Is that right?
Well, it's hard to tell how close those samples are to the actual maximum working set, but under the assumption that they are close enough, I'd say that ignoring closure growth indeed leads to a bigger working set in this case (as to be expected), although it doesn't really manifest itself in the majority of GC parameterisations I sampled (cf. the second-to-last paragraph in comment:65). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:71 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by simonpj):
what matters is that the sampled maximum residency for default was lower than allow-cg (192 vs. 196MB)
Interesting. So the (fairly complicated) closure-growth analysis busy us only 2% reduction in residency. That raises the question of whether the gain is worth the pain. We should explore that in the paper -- and of course measure more programs. We don't want a tricky analysis that doesn't pay its way. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:72 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): I already wrote something down in [https://github.com/sgraf812/late-lam- lift/blob/master/paper.pdf section 4 here]. Table 2 turns off closure growth, in particular. The results are fairly inconclusive, but I argue that we want to avoid arbitrary increases in allocations in order to stay predictable. For example `wheel-sieve1`, with a nearly 30% increase in allocation, had an increase in residency by 5.7%. I imagine that there are programs that could trigger some exponential worst case. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:73 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): The `-F1 -A1M` trick (hat tip in direction of Simon Marlow) for measuring residency discussed on [https://mail.haskell.org/pipermail/ghc- devs/2018-December/016639.html ghc-dev] is much more accurate. It seems that the difference in residency isn't as big as I thought; In the `./paraffins 19` setup, it's 196MB in both cases, with just a little bit more residency (I measured 300kb difference, I'm not confident this is really accurate or measurable) when we ignore possible closure growth. For `wheel-sieve1`, the difference in residency also is about 300kb, of 12.6MB. (Still, 30% more allocations in total.) So it seems that we produce significantly more garbage in some cases, but residency stays about the same. On the other hand, producing more garbage means more frequent Gen 0 collects. That's also evident by the number of collections in `-F1 -A1M`: 42 vs. 55 when ignoring closure growth. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:74 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): Now that #15999 has a fix, I redid some of the benchmarks. Before I got distracted, we were discussing whether the allocation heuristic ("don't allow closure growth") is really effective. Here are the results against 8.6 as the baseline: https://gist.github.com/sgraf812/d3eb0b2251512a1845adcbf356248c25 `ll-$date` is disallowing closure growth (notice absence of any regression in allocations), whereas `ll-c2-$date` allows closure growth and yields marginally better counted instructions at that. I expected more regressions in `ll-c2`, but there's really just `paraffins` regressing by over 1% wrt. `ll`. Note that `wheel-sieve1` still regresses by 28.7% in allocations, while not providing a real benefit. Let's see what `paraffins` is up to and what runtime measurements suggest. Maybe we can get rid of the closure growth heuristic (which would be a shame, but that's science). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:75 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): OK, it turns out that `paraffins` regresses by 8% in counted instructions when allowing arbitrary closure growth because of the very same function I pinned down as the reason for the alleged speed-up in comment:56. It seems like we don't want to lift the inner loop of such a list comprehension. On the other hand, there's `gen_regexp` which improves by 2-5% in runtime when we lift the next-to-inner function of a list comprehension (`go_s7zy` below): {{{ let { go_s7zy = sat-only \r [x1_s7zz] case ># [x1_s7zz y_s7zx] of { __DEFAULT -> case chr# [x1_s7zz] of ds13_s7zB { __DEFAULT -> let { ds14_s7zC = CCCS C#! [ds13_s7zB]; } in let { z_s7zD = \u [] case +# [x1_s7zz 1#] of sat_s7zE { __DEFAULT -> go_s7zy sat_s7zE; }; } in let { go1_s7zF = sat-only \r [ds15_s7zG] case ds15_s7zG of { [] -> z_s7zD; : y1_s7zI ys_s7zJ -> let { sat_s7zL = \u [] go1_s7zF ys_s7zJ; } in let { sat_s7zK = CCCS :! [ds14_s7zC y1_s7zI]; } in : [sat_s7zK sat_s7zL]; }; } in go1_s7zF lvl13_s7zw; }; 1# -> [] []; }; } in case ord# [c1_s7za] of sat_s7zM { __DEFAULT -> go_s7zy sat_s7zM; }; }}} Lifting `go_s7zy` leads to a very slight increase in allocation (so would be disallowed from the perspective of closure growth), but a decrease in runtime. Maybe this is some low-level thing again, I don't know. The difference in Cmm is as expected and I wouldn't have expected a speed-up. So, to sum this up: Deactivating closure growth checks, thus allowing arbitrary closure growth resulting from lambda lifting - leads to an improvement of ~0.1% in runtime and counted instructions (though the two don't necessarily coincide). - also leads to unpredictable regressions in total allocations, like +28.6% in `wheel-sieve1`. From the regressions I investigated (`gen_regexp`, for example), the bindings the lifting decisions of which are responsible for the biggest regression in allocations (+10% due to the inner loop above, `go1_s7zF`) are not necessarily the same bindings which contribute the speed-ups (`go_s7zy` above). I'm not sure how to come up with a criterion that separates the two classes (beneficial to lift vs. not) of bindings, so I suggest we stick to the current closure growth heuristic. Which is a little more conservative than necessary, but avoids arbitrary regressions in allocations on STG level. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:76 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): I think the [https://github.com/sgraf812/late-lam- lift/blob/master/paper.pdf paper] is now in a good shape for feedback. simonpj, would you give it a try? I marked some places with a TODO where I'd particularly appreciate another opinion. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:77 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC