
#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): Thanks for pointing me to `-G1`, very interesting! The difference in bytes copied and consequently runtime is even more grave: {{{ $ ./default 19 +RTS -s -G1 > /dev/null 359,455,256 bytes allocated in the heap 334,966,000 bytes copied during GC 188,250,032 bytes maximum residency (9 sample(s)) 2,125,824 bytes maximum slop 179 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 9 colls, 0 par 0.431s 0.431s 0.0478s 0.2337s INIT time 0.000s ( 0.000s elapsed) MUT time 0.123s ( 0.123s elapsed) GC time 0.431s ( 0.431s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.554s ( 0.554s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 2,928,555,183 bytes per MUT second Productivity 22.2% of total user, 22.2% of total elapsed $ ./allow-cg 19 +RTS -s -G1 > /dev/null 401,712,312 bytes allocated in the heap 185,583,392 bytes copied during GC 97,712,192 bytes maximum residency (38 sample(s)) 1,275,840 bytes maximum slop 93 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 38 colls, 0 par 0.221s 0.221s 0.0058s 0.1098s INIT time 0.000s ( 0.000s elapsed) MUT time 0.104s ( 0.104s elapsed) GC time 0.221s ( 0.221s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.325s ( 0.325s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 3,878,228,290 bytes per MUT second Productivity 31.9% of total user, 31.9% of total elapsed }}} The residency was cut in half! Also note the difference in number of collections and that MUT is lower than the baseline (that doesn't lift the `go` function above). Probably a caching side-effect of the smaller residency, as the situation is still the same with `-A400M`, where the baseline is faster. I don't know how, but I suspect that lifting `go` causes the GC to be less conservative about liveness of some closure objects. If I had to guess, then something keeps the closure of `go` longer alive than the growing `sat2` thunk in `go`. I played around with heap/retainer profiling, but to no avail yet. Here is a gist with the `-S -G1` output: https://gist.github.com/sgraf812/5fbcf6b81fdd7c8af1a6060832bbfa11 There are two interesting things to point out: 1. The lifted version collects much more often, but only after completing the computing intensive work. Not sure why there are so many of them, seems redundant 2. Compared to the baseline, the residency (and consequently the total heap size, it seems) grows slower, but the increase in total bytes allocated leads to an additional collection before everything drops to constant space. Not sure what to make of that data, but it doesn't contradict what I said about the closure of `go` being kept alive longer than `sat2`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:60 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler