[GHC] #12232: Opportunity to do better in register allocations

#12232: Opportunity to do better in register allocations -------------------------------------+------------------------------------- Reporter: harendra | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.0.1 (NCG) | Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: Runtime Unknown/Multiple | performance bug Test Case: | Blocked By: Blocking: | Related Tickets: #12231 Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- [https://github.com/harendra-kumar/unicode-transforms/tree/GHC_PERF_ISSUE This github repo] demonstrates that introduction of one "if" condition which is never taken, degrades the performance of the fast path loop in this particular use case by around 70%, from 2.6ms to 4.4 ms. On investigation I found two main issues. See #12231(ticket) regarding rdeundant allocations. This one is about the other issue regarding register allocations. [https://github.com/harendra-kumar/unicode- transforms/commit/261dbb35e753376e4ee160dab556051738a0be50#diff- 4be4240cb00d1b13a038d1d45fbae235 See this assembly diff] of instructions executed in both cases. The repo has both versions to play with and simpl/stg/cmm/asm are checked in. The first file in the diff is with the if condition and the second one without. To summarize, the introduction of "if" condition adds the following register manipulations before and after the main logic: {{{ # reassignments # rbx -> r10 -> r9 -> r8 -> rdi -> rsi -> rdx -> rcx -> rbx => 0x408d7e: mov %rbx,0x90(%rsp) => 0x408d86: mov %rcx,%rbx => 0x408d89: mov %rdx,%rcx => 0x408d8c: mov %rsi,%rdx => 0x408d8f: mov %rdi,%rsi => 0x408d92: mov %r8,%rdi => 0x408d95: mov %r9,%r8 => 0x408d98: mov %r10,%r9 => 0x408d9b: mov 0x90(%rsp),%r10 . . . loop logic here which uses only %rax, %r10 and %r9 . . . . # _n4s8: # shuffle back to original assignments => 0x4090dc: mov %r14,%r11 => 0x4090df: mov %r9,%r10 => 0x4090e2: mov %r8,%r9 => 0x4090e5: mov %rdi,%r8 => 0x4090e8: mov %rsi,%rdi => 0x4090eb: mov %rdx,%rsi => 0x4090ee: mov %rcx,%rdx => 0x4090f1: mov %rbx,%rcx => 0x4090f4: mov %rax,%rbx => 0x4090f7: mov 0x88(%rsp),%rax => 0x4090ff: jmpq 0x408d2a }}} The registers seem to be getting reassigned here, data flowing from one to the next. In this particular path a lot of these register movements seem unnecessary and are only undone at the end without being used. My suspicion is that this allocation is being done en bloc keeping several paths in mind, just like the heap allocations I reported in the other issue. Some questions to be answered: * Can we make this smarter? * Will `-fregs-graph` do a better job and if not what improvements can be done to graph coloring allocator to do a good job here? I have try that option yet. LLVM assembly trace is also available in the repo. == References == * [1] mail thread: https://mail.haskell.org/pipermail/ghc- devs/2016-June/012243.html * [2] github code repo: https://github.com/harendra-kumar/unicode- transforms/tree/GHC_PERF_ISSUE -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12232 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12232: Opportunity to do better in register allocations -------------------------------------+------------------------------------- Reporter: harendra | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #12231 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by harendra: @@ -1,4 +1,4 @@ - [https://github.com/harendra-kumar/unicode-transforms/tree/GHC_PERF_ISSUE - This github repo] demonstrates that introduction of one "if" condition - which is never taken, degrades the performance of the fast path loop in - this particular use case by around 70%, from 2.6ms to 4.4 ms. + [https://github.com/harendra-kumar/unicode-transforms/tree/ghc- + trac-12232-better This github repo] demonstrates that introduction of one + "if" condition which is never taken, degrades the performance of the fast + path loop in this particular use case by around 70%, from 2.6ms to 4.4 ms. @@ -15,1 +15,5 @@ - if condition and the second one without. + if condition and the second one without. Look for git tags + [https://github.com/harendra-kumar/unicode-transforms/tree/ghc- + trac-12232-better ghc-trac-12232-better] and [https://github.com/harendra- + kumar/unicode-transforms/tree/ghc-trac-12232-worse ghc-trac-12232-worse] + for the two versions. @@ -74,1 +78,1 @@ - transforms/tree/GHC_PERF_ISSUE + transforms/tree/ghc-trac-12232-better New description: [https://github.com/harendra-kumar/unicode-transforms/tree/ghc- trac-12232-better This github repo] demonstrates that introduction of one "if" condition which is never taken, degrades the performance of the fast path loop in this particular use case by around 70%, from 2.6ms to 4.4 ms. On investigation I found two main issues. See #12231(ticket) regarding rdeundant allocations. This one is about the other issue regarding register allocations. [https://github.com/harendra-kumar/unicode- transforms/commit/261dbb35e753376e4ee160dab556051738a0be50#diff- 4be4240cb00d1b13a038d1d45fbae235 See this assembly diff] of instructions executed in both cases. The repo has both versions to play with and simpl/stg/cmm/asm are checked in. The first file in the diff is with the if condition and the second one without. Look for git tags [https://github.com/harendra-kumar/unicode-transforms/tree/ghc- trac-12232-better ghc-trac-12232-better] and [https://github.com/harendra- kumar/unicode-transforms/tree/ghc-trac-12232-worse ghc-trac-12232-worse] for the two versions. To summarize, the introduction of "if" condition adds the following register manipulations before and after the main logic: {{{ # reassignments # rbx -> r10 -> r9 -> r8 -> rdi -> rsi -> rdx -> rcx -> rbx => 0x408d7e: mov %rbx,0x90(%rsp) => 0x408d86: mov %rcx,%rbx => 0x408d89: mov %rdx,%rcx => 0x408d8c: mov %rsi,%rdx => 0x408d8f: mov %rdi,%rsi => 0x408d92: mov %r8,%rdi => 0x408d95: mov %r9,%r8 => 0x408d98: mov %r10,%r9 => 0x408d9b: mov 0x90(%rsp),%r10 . . . loop logic here which uses only %rax, %r10 and %r9 . . . . # _n4s8: # shuffle back to original assignments => 0x4090dc: mov %r14,%r11 => 0x4090df: mov %r9,%r10 => 0x4090e2: mov %r8,%r9 => 0x4090e5: mov %rdi,%r8 => 0x4090e8: mov %rsi,%rdi => 0x4090eb: mov %rdx,%rsi => 0x4090ee: mov %rcx,%rdx => 0x4090f1: mov %rbx,%rcx => 0x4090f4: mov %rax,%rbx => 0x4090f7: mov 0x88(%rsp),%rax => 0x4090ff: jmpq 0x408d2a }}} The registers seem to be getting reassigned here, data flowing from one to the next. In this particular path a lot of these register movements seem unnecessary and are only undone at the end without being used. My suspicion is that this allocation is being done en bloc keeping several paths in mind, just like the heap allocations I reported in the other issue. Some questions to be answered: * Can we make this smarter? * Will `-fregs-graph` do a better job and if not what improvements can be done to graph coloring allocator to do a good job here? I have try that option yet. LLVM assembly trace is also available in the repo. == References == * [1] mail thread: https://mail.haskell.org/pipermail/ghc- devs/2016-June/012243.html * [2] github code repo: https://github.com/harendra-kumar/unicode- transforms/tree/ghc-trac-12232-better -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12232#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12232: Opportunity to do better in register allocations -------------------------------------+------------------------------------- Reporter: harendra | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #12231 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tjakway): * cc: tjakway (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12232#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12232: Opportunity to do better in register allocations -------------------------------------+------------------------------------- Reporter: harendra | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #12231 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by jberryman): I just added a similar (I think) issue: #13094 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12232#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12232: Opportunity to do better in register allocations -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #12231 #14672 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by AndreasK): * related: #12231 => #12231 #14672 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12232#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12232: Opportunity to do better in register allocations -------------------------------------+------------------------------------- Reporter: harendra | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: CodeGen Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #12231 #14672 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by AndreasK): * keywords: => CodeGen -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12232#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC