
#9476: Implement late lambda-lifting -------------------------------------+------------------------------------- Reporter: simonpj | Owner: sgraf Type: feature request | Status: patch Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: LateLamLift Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224 Wiki Page: LateLamLift | -------------------------------------+------------------------------------- Comment (by sgraf): I'm currently trying to find the right configuration for Runtime benchmarking. When using the NCG on the architecture I benchmark on, there are seemingly random outliers performance-wise, even when ignoring benchmarks with less than 200ms running time. Take `CSD` from `real/eff` for example. On the target architecture (i7-6700), things consistently are 4.5% slower, yet ''there isn't a single lifted function in that benchmark''. It's basically just a counting loop. To make matters worse, I can't reproduce this on my local PC, quite the contrary there. Altogether this makes for a very meager improvement of -0.2% in runtime. This leads me to believe that the (relatively minor) benefits are obscured by code size and layout concerns. If I only include benchmarks that ran at least 500ms, things look much better (-0.4%), but that's probably because I excluded the `eff` 'microbenchmarks'. I tried another configuration that probably does better justice to the optimisation: I re-ran the benchmarks with `-fllvm -optlo -Os` to have the LLVM optimise for size concerns which IME yields less code layout dependent results. Anyway, ignoring benchmarks with <200ms runtime yields an improvement of -1.0% (result: https://ghc.haskell.org/trac/ghc/attachment/ticket/9476/nofib.txt), while ignoring all benchmarks with <500ms runtime yields an -1.2% improvement. Ironically, runtime of `CSD` ''improved'' by -7.1%. Notable is also that while `n-body` allocates 20% less (heap space!), it got slower by a non-meaningful margin of 0.1%. Maybe watching out for allocations isn't the be all end all here. I really think we should flag benchmarks for being eligible for runtime measurements. I get hung up on what are architectural wibbles ''all the time''. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9476#comment:52 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler