[GHC] #16243: Improve fregs-graph.

27 Jan 2019

      #16243: Improve fregs-graph.
-------------------------------------+-------------------------------------
           Reporter:  AndreasK       |             Owner:  (none)
               Type:  task           |            Status:  new
           Priority:  normal         |         Milestone:
          Component:  Compiler       |           Version:  8.6.3
  (CodeGen)                          |
           Keywords:  fregs-graph    |  Operating System:  Unknown/Multiple
       Architecture:                 |   Type of failure:  None/Unknown
  Unknown/Multiple                   |
          Test Case:                 |        Blocked By:
           Blocking:                 |   Related Tickets:
Differential Rev(s):                 |         Wiki Page:
-------------------------------------+-------------------------------------
 I'm creating this as a way to keep track of shortcomings of the current
 implementation.

 Some of these are directly taken from TODOS, other are my own opinions.

 = Things that could be looked at.

 == Compress spill slots.

 Currently we assign each vreg a fixed spill slot. This is terrible for
 cache performance.

 It's possible to end up with sequences like this:

 {{{#!asm
 block_ca01_info:
 _ca01:
         movq 16(%rbp),%r14
         movq 24(%rbp),%rax
         movq %rax,144(%rsp)
         movq 32(%rbp),%rsi
         movq 40(%rbp),%rdi
         movq 64(%rbp),%rbx
         movq 56(%rbp),%rax
         movq %rax,208(%rsp)
         movq 48(%rbp),%rax
         movq %rax,168(%rsp)
         movq 8(%rbp),%rax
         movq %rax,216(%rsp)
         addq $8,%rbp
 }}}

 Which splits spill slots over multiple cache lines increasing cache
 pressure.

 Things possibly worth trying:

 * Assign spill slots in the order they appear - hopefully reducing the
 average spread within code blocks.
 * Dynamically assign vregs to spill slots.

 For the later if we have large independent subgraphs we will sometimes get
 away with using far fewer slots.
 The reason simply being that non overlapping vregs can share spill slots.

 Doing this in an efficient manner however isn't easy.

 == We sometimes load + spill vregs when we could use memory addressing.

 One example I've seen a few times in nofib is this:
 {{{
         movq 208(%rsp),%rax
         incq %rax
         movq %rax,208(%rsp)
 }}}

 Which we should replace with a single `inc` operating on memory.

 == Lift the stack size limit.

 There is already a patch
 [https://gitlab.haskell.org/ghc/ghc/merge_requests/219 in flight].

 == Revisit spill costs.

 Currently the allocator always spills the longest live range. Which often
 works but is certainly not a great metric.

 Ideas:

 === Consider loop membership.

 At least for x86/64 this would be easy to achieve on basic block
 granularity using the CFG that is also used for block layout.

 === Revisit chaitins cost model.

 There have been major changes to cpus (amd64 becoming the norm) and the
 backend
 so at a minimum this cost model should be reevaluated.

 == Try splitting long live ranges instead of simply spilling the complete
 live range.

 After all long live ranges were the original reason to switch to a life
 range only based cost heuristic.

 == Use a different spill algorithm altogether.

 Maybe Chows live range splitting based approach would be an option.
 See "The priority based coloring approach to register allocation" for
 details.

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16243
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler