[GHC] #12232: Opportunity to do better in register allocations

26 Jun 2016

      #12232: Opportunity to do better in register allocations
-------------------------------------+-------------------------------------
           Reporter:  harendra       |             Owner:
               Type:  bug            |            Status:  new
           Priority:  normal         |         Milestone:
          Component:  Compiler       |           Version:  8.0.1
  (NCG)                              |
           Keywords:                 |  Operating System:  Unknown/Multiple
       Architecture:                 |   Type of failure:  Runtime
  Unknown/Multiple                   |  performance bug
          Test Case:                 |        Blocked By:
           Blocking:                 |   Related Tickets:  #12231
Differential Rev(s):                 |         Wiki Page:
-------------------------------------+-------------------------------------
 [https://github.com/harendra-kumar/unicode-transforms/tree/GHC_PERF_ISSUE
 This github repo] demonstrates that introduction of one "if" condition
 which is never taken, degrades the performance of the fast path loop in
 this particular use case by around 70%, from 2.6ms to 4.4 ms.

 On investigation I found two main issues. See #12231(ticket) regarding
 rdeundant allocations. This one is about the other issue regarding
 register allocations.

 [https://github.com/harendra-kumar/unicode-
 transforms/commit/261dbb35e753376e4ee160dab556051738a0be50#diff-
 4be4240cb00d1b13a038d1d45fbae235 See this assembly diff] of instructions
 executed in both cases. The repo has both versions to play with and
 simpl/stg/cmm/asm are checked in. The first file in the diff is with the
 if condition and the second one without.

 To summarize, the introduction of "if" condition adds the following
 register manipulations before and after the main logic:

 {{{
 # reassignments
 # rbx -> r10 -> r9 -> r8 -> rdi -> rsi -> rdx -> rcx -> rbx
 => 0x408d7e: mov    %rbx,0x90(%rsp)
 => 0x408d86: mov    %rcx,%rbx
 => 0x408d89: mov    %rdx,%rcx
 => 0x408d8c: mov    %rsi,%rdx
 => 0x408d8f: mov    %rdi,%rsi
 => 0x408d92: mov    %r8,%rdi
 => 0x408d95: mov    %r9,%r8
 => 0x408d98: mov    %r10,%r9
 => 0x408d9b: mov    0x90(%rsp),%r10
 .
 .
 .
 loop logic here which uses only %rax, %r10 and %r9 .
 .
 .
 .
 # _n4s8:
 # shuffle back to original assignments
 => 0x4090dc: mov    %r14,%r11
 => 0x4090df: mov    %r9,%r10
 => 0x4090e2: mov    %r8,%r9
 => 0x4090e5: mov    %rdi,%r8
 => 0x4090e8: mov    %rsi,%rdi
 => 0x4090eb: mov    %rdx,%rsi
 => 0x4090ee: mov    %rcx,%rdx
 => 0x4090f1: mov    %rbx,%rcx
 => 0x4090f4: mov    %rax,%rbx
 => 0x4090f7: mov    0x88(%rsp),%rax

 => 0x4090ff: jmpq   0x408d2a
 }}}

 The registers seem to be getting reassigned here, data flowing from one to
 the next. In this particular path a lot of these register movements seem
 unnecessary and are only undone at the end without being used. My
 suspicion is that this allocation is being done en bloc keeping several
 paths in mind, just like the heap allocations I reported in the other
 issue.

 Some questions to be answered:
 * Can we make this smarter?
 * Will `-fregs-graph` do a better job and if not what improvements can be
 done to graph coloring allocator to do a good job here? I have try that
 option yet.

 LLVM assembly trace is also available in the repo.

 == References ==
 * [1] mail thread: https://mail.haskell.org/pipermail/ghc-
 devs/2016-June/012243.html
 * [2] github code repo: https://github.com/harendra-kumar/unicode-
 transforms/tree/GHC_PERF_ISSUE

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12232
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler