[GHC] #15124: Improve block layout for the NCG

#15124: Improve block layout for the NCG -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: (none) Type: task | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 8.2.2 (NCG) | Keywords: CodeGen | Operating System: Unknown/Multiple Architecture: | Type of failure: None/Unknown Unknown/Multiple | Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- The current codegen tries to place two blocks A and B next to each other if they are "close". = Drawbacks of the current algorithm Blocks are only considered close if one jumps to the other via a unconditional jump instructions. This is a fast and reasonable approximation. But we might be able to do better. == Calls are not considered even if they always return to the same block. For example it's clear that if foo == bar we jump to C here: {{{ A: movq $block_C_info,-16(%rbp) cmp foo,bar jne D B: jmp *(%rbx) #Returns to C ... .align 8 .infotable block_C_info: C: doThings }}} == Conditional jumps are never considered. The following block either jumps directly to c3mQ or returns there from a call. Yet that is completely ignored by the current algorithm. {{{ _c3mG: movq $block_c3mQ_info,-8(%rbp) ... jne _c3mQ jmp *(%rbx) #returns to c3mQ }}} == Likelyhood of branches is not considered. We track information about branches which are rarely taken at the cmm level. However this information is currently lost during the Cmm -> Asm transformation which happens before we do code layout. This means for Cmm code where one branch is never executed the branch that's never executed might be the only one considered relevant. This happens for example for this stack check: {{{ if ((Sp + -40) >= SpLim) (likely: True) goto c5PS; else goto c5Q3; }}} = Potential gains If nothing else this would allow some Cmm optimizations which are made useless by the current code layout algorithm. I have hit cases where the block layout algorithm turned optimizations into pessimizations of 2-5% by pulling obvious loops apart. Given that the scenario that leads to loops being pulled apart doesn't seem that unlikely I assume some loops are also pulled apart like this in HEAD. Cleaning this up ideally would also make it possible to move blocks which are not part of loops out of them. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15124 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15124: Improve block layout for the NCG -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: (none) Type: task | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler (NCG) | Version: 8.2.2 Resolution: | Keywords: CodeGen Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by AndreasK): Besides the usual (Aigner Guides, Intel Optimization Manuals) here are a few more references which should make a good starting point to explore this space: [1] Thomas Ball, James R. Larus. Branch Prediction for Free (https://do i.org/10.1145/173262.155119) [2] Hans Wennborg. The recent switch lowering improvements. (http://llv m.org/devmtg/2015-10/slides/Wennborg-SwitchLowering.pdf) See also: http s://www.youtube.com/watch?v=gMqSinyL8uk [3] James E. Smith. A study of branch prediction strategies (https://dl .acm.org/citation.cfm?id=801871) [4] http://llvm.org/doxygen/MachineBlockPlacement_8cpp_source.html [5] Karl Pettis, Robert C. Hansen. Profile guided code positioning. (ht tps://doi.org/10.1145/93542.93550) [6] Hashemi et al. Efficient procedure mapping using cache line coloring (https://doi.org/10.1145/258915.258931) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15124#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15124: Improve block layout for the NCG -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: (none) Type: task | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler (NCG) | Version: 8.2.2 Resolution: | Keywords: CodeGen Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #8082 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by jstolarek): * related: => #8082 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15124#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15124: Improve block layout for the NCG -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: (none) Type: task | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler (NCG) | Version: 8.2.2 Resolution: | Keywords: CodeGen Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #8082 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by jmct): * cc: jmct (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15124#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15124: Improve block layout for the NCG -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler (NCG) | Version: 8.2.2 Resolution: | Keywords: CodeGen Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #8082 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by AndreasK): * owner: (none) => AndreasK -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15124#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15124: Improve block layout for the NCG -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler (NCG) | Version: 8.2.2 Resolution: | Keywords: CodeGen Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #8082 #14672 | Differential Rev(s): Phab:D4726 Wiki Page: | -------------------------------------+------------------------------------- Changes (by AndreasK): * differential: => Phab:D4726 * related: #8082 => #8082 #14672 Comment: I've wrote up a patch implementing an experimental code layout algorithm, which hopefully is easy to adjust to take advantage of additional information about control flow. Code is at Phab:D4726. As it stands: * It works (only) on x64. * Performance is around the same. (Better but within the margin of error). * Compiler performance is slightly worse. There are a lot of knobs to tune this approach further but verifying any results is tedious so I stopped eventually. For now I hope to use this eventually as a substrate for the work on #14672. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15124#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15124: Improve block layout for the NCG -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler (NCG) | Version: 8.2.2 Resolution: | Keywords: CodeGen Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #8082 #14672 | Differential Rev(s): Phab:D4726 Wiki Page: | -------------------------------------+------------------------------------- Comment (by AndreasK): Ater a bit of tweaking runtime difference is at -0.5%. The next step is to verify if this holds up on a different machines as well. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15124#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15124: Improve block layout for the NCG -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler (NCG) | Version: 8.2.2 Resolution: | Keywords: CodeGen Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #8082 #14672 | Differential Rev(s): Phab:D4726 Wiki Page: | -------------------------------------+------------------------------------- Comment (by AndreasK): Replying to [comment:5 AndreasK]:
I've wrote up a patch implementing an experimental code layout algorithm, which hopefully is easy to adjust to take advantage of additional information about control flow.
Code is at Phab:D4726. As it stands:
* It works (only) on x64. * Performance is around the same. (Better but within the margin of error). * Compiler performance is slightly worse.
There are a lot of knobs to tune this approach further but verifying any results is tedious so I stopped eventually.
For now I hope to use this eventually as a substrate for the work on #14672.
I've also created a wiki page here https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/CodeLayout with more details.
I've came over a comment here: https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/Backends/NCG#Redun...
The allocator goes to considerable computational expense to construct all the flow edges in the group of instructions it's allocating for, by using the insnFuture function in the Instr pseudo-abstract type.
Since my patch already constructs and maintains a CFG maybe we can reuse this. Although currently it's on the basic block not instruction level and I haven't checked if the comment is still relevant either. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15124#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15124: Improve block layout for the NCG -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler (NCG) | Version: 8.2.2 Resolution: | Keywords: CodeGen Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #8082 #14672 | Differential Rev(s): Phab:D4726 Wiki Page: | -------------------------------------+------------------------------------- Comment (by AndreasK): Snapshot of the performance gains by the linked patch. (Speedup - so **higher is better**). Last I checked nofib performance was neither here no there in terms of runtime. (Slightly slower on sandy bridge, slightly faster on haswell). Compile time with the patch went down on both though. || Library ||Sandy Bridge (Linux) || Haswell (Linux) || Skylake (Win) || || aeson || +2.6%/+2.8% || +2.3%/-0% || +1.2%/+1.0% || containers || +1.4%/+1.2% || +1.1%/+1.7% || +1.7%/+1.0% || megaparsec || +3.2%/+3.6% || +13.6%/+13.7% 1) || +8.0%/+6.6% || perf-xml || +0.2%/+0.0% || || +1.1%/+0.8% || text || +3.0%/+3.0% || NA || NA || Vector || +2.5%/+3.6% || +2.5%/+2.9% || +1.3%/3.8% 1 ) Inflated by background noise. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15124#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15124: Improve block layout for the NCG -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler (NCG) | Version: 8.2.2 Resolution: | Keywords: CodeGen Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #8082 #14672 | Differential Rev(s): Phab:D4726 Wiki Page: | -------------------------------------+------------------------------------- Comment (by kavon): Dropping a recent paper on this topic that uses dynamic profiling, though there might be some adaptable ideas / further reading in here: https://arxiv.org/abs/1810.00905 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15124#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Dropping a recent paper on this topic that uses dynamic profiling,
#15124: Improve block layout for the NCG -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler (NCG) | Version: 8.2.2 Resolution: | Keywords: CodeGen Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #8082 #14672 | Differential Rev(s): Phab:D4726 Wiki Page: | -------------------------------------+------------------------------------- Comment (by AndreasK): Replying to [comment:10 kavon]: though there might be some adaptable ideas / further reading in here: https://arxiv.org/abs/1810.00905 Cool idea! I don't know how much Haskell could would benefit comperatively here as I would assume info tables would get in the way of fall throughs between functions. But might still be worth doing eventually. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15124#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15124: Improve block layout for the NCG
-------------------------------------+-------------------------------------
Reporter: AndreasK | Owner: AndreasK
Type: task | Status: new
Priority: normal | Milestone: 8.8.1
Component: Compiler (NCG) | Version: 8.2.2
Resolution: | Keywords: CodeGen
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: #8082 #14672 | Differential Rev(s): Phab:D4726
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Andreas Klebinger
participants (1)
-
GHC