[GHC] #8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all ----------------------------------------------+---------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.7 Keywords: callingConvention | Operating System: Architecture: x86_64 (amd64) | Unknown/Multiple Difficulty: Project (more than a week) | Type of failure: Blocked By: | None/Unknown Related Tickets: | Test Case: | Blocking: ----------------------------------------------+---------------------------- testing if SpLim=$rbp and Sp=$rsp changed performance at all would need a stack check but then push could be used to spill to the stack Idea via Nathan Howell. At the very least, the x86 PUSH instruction has a more succinct encoding than MOV. worth hacking out to see if this can measurable shift ghc perf on nofib or not. this would be part of a larger effort to explore ways to improve GHC's calling convention for modern hardware -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all ----------------------------+---------------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: 7.10.1 Component: | Version: 7.7 Compiler | Keywords: callingConvention Resolution: | Architecture: x86_64 (amd64) Operating System: | Difficulty: Project (more than a week) Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): There is a comment in the original STG paper about why PUSH/POP could not be used for GHC's stack. The circumstances may have changed. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all ----------------------------+---------------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: 7.10.1 Component: | Version: 7.7 Compiler | Keywords: callingConvention Resolution: | Architecture: x86_64 (amd64) Operating System: | Difficulty: Project (more than a week) Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by carter): @ezyang you mean "Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine, SL Peyton Jones, Journal of Functional Programming 2(2), Apr 1992, pp127-202." ? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all ----------------------------+---------------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: 7.10.1 Component: | Version: 7.7 Compiler | Keywords: callingConvention Resolution: | Architecture: x86_64 (amd64) Operating System: | Difficulty: Project (more than a week) Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by carter): hrm... that seems like it might not be the right paper -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all ----------------------------+---------------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: 7.10.1 Component: | Version: 7.7 Compiler | Keywords: callingConvention Resolution: | Architecture: x86_64 (amd64) Operating System: | Difficulty: Project (more than a week) Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): Sorry, I misspoke. I am actually speaking of CALL/RET, and the problem is described in "Faster Laziness Using Dynamic Pointer Tagging." There is one more problem with setting Sp=$rsp, and that is that the register is already taken: execution maintains both a Haskell stack and a C stack. The C stack is used for register spills by the register allocator. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all ----------------------------+---------------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: 7.10.1 Component: | Version: 7.7 Compiler | Keywords: callingConvention Resolution: | Architecture: x86_64 (amd64) Operating System: | Difficulty: Project (more than a week) Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by carter): huh. that doesn't rule out experimenting with it, but definitely does make any such experimentation a bit more subtle. We'd basically need to make sure that theres enough scratch space on the current GHC stack segment for the register spills. i believe on sound way to conservatively track that is to compute the maximum size used by live variables in a function body, though with some extra adjustments roughly, right? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all -------------------------------------+------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: callingConvention Operating System: Unknown/Multiple | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Difficulty: Rocket Science Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Changes (by carter): * difficulty: Project (more than a week) => Rocket Science Comment: because of how subtle this may be, and how its moderately likely that the work will ultimately not work out (though its worth exploring), i'm setting the difficulty to "rocket science" :) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all -------------------------------------+------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: callingConvention Operating System: Unknown/Multiple | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Difficulty: Rocket Science Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by schyler): It's possible this could fix #8048 or at least allow better/smarter code to be generated for spills. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all -------------------------------------+------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: callingConvention Operating System: Unknown/Multiple | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Difficulty: Rocket Science Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Changes (by schyler): * cc: schyler (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all -------------------------------------+------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: | callingConvention Operating System: Unknown/Multiple | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by michalt): * cc: michalt (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all -------------------------------------+------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: | callingConvention Operating System: Unknown/Multiple | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by michalt): * cc: simonmar, rwbarton (added) Comment: There's been an interesting discussion about this in: https://github.com/ghc-proposals/ghc-proposals/pull/17 If people still think this would be a good idea, I'd like to try things out. But I need some help to get started. :) First question. It seems to me that there are two slightly different ideas here: 1) using `%rsp` for managing the Haskell stack (instead of `%rbp`) 2) using `call`/`ret` Is 1) really a strict requirement for 2)? IIUC we're already using C-stack (`%rsp`) for spilling things during register allocation. The only problem I can think of is the amount of space for those return addresses. Am I missing something else or is this enough to prevent us from using `call`/`ret`? (also, somewhat related, wouldn't using `call`/`ret` allow us to get rid of proc-point splitting for LLVM?) Second question. For every `CmmCall` that contains `cml_cont` we'd need to compile this into two instructions: `call` and `jmp <block from cm_cont>` (to jump over the info table that preceeds the block where we want to return to). But that means that the return address no longer points to the info table. Does anything else depend on this? Is there some easy way to check that? Third question. How could this work in LLVM backend? I don't think LLVM even allows direct manipulation of the stack pointer. Also, even if we could manipulate it, I wouldn't be surprised if LLVM wanted to move things around the stack, which again sounds pretty problematic (for, e.g., GC) PS. CCing simonmar and rwbarton since they were the main contributors to the linked GHC proposal. (hope you don't mind!) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all -------------------------------------+------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: | callingConvention Operating System: Unknown/Multiple | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): In my view (1) is really the more interesting change as it enables use of native profiling tools. Currently we are essentially unable to use systems' native profiling tools to collect callstacks (e.g. `perf record --callgraph`) since they generally assume that the callstack is tracked by `%rsp`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all -------------------------------------+------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: | callingConvention Operating System: Unknown/Multiple | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): @bgamari: doesn't the DWARF stack unwinding support that you've been working on help with `perf`? @michalt: I think we have to do (1) in order to do call/ret, because otherwise the stack would be split over two places, and the RTS would have a terrible time walking it. Or perhaps I've misunderstood what you mean? I'm actually not all that enthusiastic about the proposal having just re- read https://github.com/ghc-proposals/ghc-proposals/pull/17. The benefits are small or are actually regressions (code size and the overhead for jmp after call) and it's a huge upheaval. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all -------------------------------------+------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: | callingConvention Operating System: Unknown/Multiple | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by carter): @simonmar the stack pointer experiment sans call and rest might still be worth measuring,yes? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all -------------------------------------+------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: | callingConvention Operating System: Unknown/Multiple | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by michalt): @simonmar: When you say "terrible time walking the stack", you mean GC, right? I've did some more googling and also found [1] which describes some of the differences between the Haskell and C stacks (eg, the Haskell one is heap allocated and easy to grow, the C one is per capability, etc.) I agree - all of this makes it sound that if we want call/ret, we do need (1). Also I still don't really see how we could use `%rsp` in LLVM backend without pretty large changes (eg, using its stackmaps/safepoints). I have to admit that I'm also less enthusiastic about the whole idea after looking into it a bit. [1] https://www.reddit.com/r/haskell/comments/1wm9n4/question_about_stacks_in_ha... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all -------------------------------------+------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: | callingConvention Operating System: Unknown/Multiple | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar):
@simonmar the stack pointer experiment sans call and rest might still be worth measuring,yes?
Having actual data would be a good starting point, but I'm not hopeful that it will be a win.
@simonmar: When you say "terrible time walking the stack", you mean GC, right?
Yes, and exceptions and various other RTS routines that need to understand the stack. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8272: testing if SpLim=$rbp and Sp=$rsp changed performance at all -------------------------------------+------------------------------------- Reporter: carter | Owner: carter Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.7 Resolution: | Keywords: | callingConvention Operating System: Unknown/Multiple | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bollu): * cc: bollu (added) Comment: CCing myself in light of some [https://github.com/bollu/simplexhc-cpp simplexhc] experiments. On ackermann function, [https://gist.github.com/bgamari/bd424e82d96ddb7b9e67c5e51cdcc5ec GHC and clang O3 don't scale linearly]. On teaching simplexhc to generate C-like code (using `call/ret` instead of custom stack frame management), I began getting C-like performance on the same examples. Previously, I was around GHC, or somewhat slower. [https://gist.github.com/bollu/a59c3c8d2c193e2d63409064b6d855c3 Link to numbers from that experiment here]. I sent out an e-mail to LLVM-dev, asking if it is possible to fake GHC- like "custom call stack on the heap" within LLVM. If possible, I'd like to implement that and check for slowdowns in C code. That should provide some data of the benefits of choosing to use the native call stack. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8272#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC