
#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: closed Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): I'm currently writing the evaluation chapter of the paper and hit quite a bummer. If I add an option to ''ignore'' closure growth completely, instructions improve by another 0.3% in the mean, with no regressions (!) wrt. the baseline where we try to avoid closure growth. Although total allocations possibly increase, executed instructions ''go down'' in some cases. The most prominent case is `paraffins`: +18.6% more allocations, but -11.7% less executed instructions! Example run: {{{ $ ./default 19 +RTS -s > /dev/null 359,457,968 bytes allocated in the heap 536,016,504 bytes copied during GC 163,983,272 bytes maximum residency (8 sample(s)) 1,699,968 bytes maximum slop 156 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 339 colls, 0 par 0.137s 0.137s 0.0004s 0.0025s Gen 1 8 colls, 0 par 0.386s 0.386s 0.0482s 0.1934s INIT time 0.000s ( 0.000s elapsed) MUT time 0.058s ( 0.058s elapsed) GC time 0.523s ( 0.523s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.581s ( 0.581s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 6,243,240,498 bytes per MUT second Productivity 9.9% of total user, 9.9% of total elapsed $ ./allow-cg 19 +RTS -s > /dev/null 426,433,296 bytes allocated in the heap 488,364,024 bytes copied during GC 139,063,776 bytes maximum residency (8 sample(s)) 2,223,648 bytes maximum slop 132 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 403 colls, 0 par 0.136s 0.136s 0.0003s 0.0010s Gen 1 8 colls, 0 par 0.317s 0.317s 0.0397s 0.1517s INIT time 0.000s ( 0.000s elapsed) MUT time 0.080s ( 0.080s elapsed) GC time 0.453s ( 0.453s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.533s ( 0.533s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 5,359,023,067 bytes per MUT second Productivity 14.9% of total user, 14.9% of total elapsed }}} Note how allocations and number of collections (these are correlated) went up, but bytes copied and GC time (also correlated) went down. The only possible conclusion here is that viewing bytes allocated as a metric for predicting runtime is flawed: What matters most for runtime is bytes copied during GC. Of course there's an overhead for heap checks etc., but it seems that GC pressure far outweighs that. Interestingly, if I 'disable' the GC by allocating 600MB of nursery, the inverse effect manifests: {{{ $ ./default 19 +RTS -s -A600M > /dev/null 359,104,696 bytes allocated in the heap 3,384 bytes copied during GC 44,480 bytes maximum residency (1 sample(s)) 25,152 bytes maximum slop 0 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s INIT time 0.009s ( 0.009s elapsed) MUT time 0.127s ( 0.127s elapsed) GC time 0.000s ( 0.000s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.136s ( 0.136s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 2,821,937,180 bytes per MUT second Productivity 93.4% of total user, 93.5% of total elapsed $ ./allow-cg 19 +RTS -s -A600M > /dev/null 426,014,360 bytes allocated in the heap 3,384 bytes copied during GC 44,480 bytes maximum residency (1 sample(s)) 25,152 bytes maximum slop 0 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s INIT time 0.011s ( 0.011s elapsed) MUT time 0.142s ( 0.142s elapsed) GC time 0.000s ( 0.000s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.153s ( 0.153s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 2,994,250,656 bytes per MUT second Productivity 92.9% of total user, 92.9% of total elapsed }}} This is probably because of cache effects, although apparently there seem to be strictly less instructions executed in the `allow-cg` case, weird. Also I don't think that enlarging the nursery to fit all allocations is a realistic benchmark scenario. The takeaway here is that lambda lifting stuff somehow has beneficial effects on GC, even if overall allocations go up and more collections happen as a consequence. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:55 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler