[GHC] #8578: Improvements to SpinLock implementation

#8578: Improvements to SpinLock implementation ------------------------------------+------------------------------------- Reporter: parcs | Owner: parcs Type: task | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 7.7 Keywords: | Operating System: Unknown/Multiple Architecture: Unknown/Multiple | Type of failure: None/Unknown Difficulty: Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | ------------------------------------+------------------------------------- The SpinLock implementation has a number of deficiencies. Here is a pair of patches that improves its implementation. Let me know what you think. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8578 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8578: Improvements to SpinLock implementation -------------------------------------+------------------------------------ Reporter: parcs | Owner: parcs Type: task | Status: patch Priority: normal | Milestone: Component: Runtime System | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Changes (by parcs): * status: new => patch -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8578#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8578: Improvements to SpinLock implementation -------------------------------------+------------------------------------ Reporter: parcs | Owner: parcs Type: task | Status: patch Priority: normal | Milestone: Component: Runtime System | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by parcs): According to a C microbenchmark, the new spinlock code is much more scalable than the old code. However, this scalability improvement remains to be seen in a GHC program. Is there a program one can test that is particularly sensitive to the scalability of the RTS's spinlocks? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8578#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8578: Improvements to SpinLock implementation -------------------------------------+------------------------------------ Reporter: parcs | Owner: parcs Type: task | Status: patch Priority: normal | Milestone: Component: Runtime System | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by simonmar): The patch looks like a nice improvement, thanks! Before committing I want to run some benchmarks. The usual ones I use are `nofib/parallel` running on various numbers of cores, at least 8. I can try these sometimes next week. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8578#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8578: Improvements to SpinLock implementation -------------------------------------+------------------------------------ Reporter: parcs | Owner: parcs Type: task | Status: patch Priority: normal | Milestone: Component: Runtime System | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by thoughtpolice): Very good stuff Patrick looks nice to me - and I can see how it'd help scalability. I did a quick glance and one question I have is why are we using `busy_wait_nop` in the wait loop, which just becomes `rep; nop` on x86? For spinlock implementations, Intel at least has a `PAUSE` instruction which specifically informs the CPU that this is a spinlock wait-loop, which allows it to optimize cache and memory accesses if possible to avoid memory ordering violations, requiring the CPUs to synchronize. This can be quite dramatic on older processors I believe (I wouldn't be surprised if `rep; nop` devolved in `pause` on some microarchs, but probably not fully guaranteed.) `PAUSE` might also actually *delay* the CPU for this optimization to happen, where as `rep nop` will merely run as fast as possible. So I'd suggest replacing `busy_wait_nop` on x86 with something like `_pause_mm` from GCC: {{{ #define _mm_pause() \ ({ \ __asm__ __volatile__("pause" ::: "memory"); \ }) }}} Finally, one thing I did for fun a while back was write an accelerated spinlock implementation using the new TSX extensions in Intel processors. In my very non-scientific experiments, this essentially eliminated any lock contention (or even any need to take a lock) in a lot of cases. I wonder if it's worth adding here to see if there's any difference in throughput or contention. You can find the code here (pinpointed to the code in question): https://gist.github.com/thoughtpolice/7123036#file-spinlock-rtm-c-L230 Note my example does *not* use lock elision prefixes for `xchg` (which are backwards compatible,) but instead uses the new TSX RTM instructions to wrap locks/unlocks in a transaction, allowing speculative elision. In practice this works just as well and RTM is more flexible. It would however require checking `cpuid` to see if TSX is available and if so, dynamically replacing the code path as I have done. On Haswell, this indirect-branch would probably be predicted so well its overhead is basically zero, but I don't know what it might do to e.g. AMD machines in terms of prediction or throughput. In any case - Patrick, thanks. If you'd like to test the elision thing, or the impact of `_mm_pause`, I have a powerful Haswell server available (8 hardware threads/32GB RAM) you can play around with for testing. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8578#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8578: Improvements to SpinLock implementation -------------------------------------+------------------------------------ Reporter: parcs | Owner: parcs Type: task | Status: patch Priority: normal | Milestone: Component: Runtime System | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by simonmar): "rep; nop" ''is'' the pause instruction :-) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8578#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8578: Improvements to SpinLock implementation -------------------------------------+------------------------------------ Reporter: parcs | Owner: parcs Type: task | Status: patch Priority: normal | Milestone: Component: Runtime System | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by thoughtpolice): Ah, yes, you're right - I did some more digging. The tricky bit is that `rep; nop` and `pause` actually have the same opcodes (`F390`) for backwards compatibility, but on newer CPUs, `rep; nop` is treated magically (allowing other hyperthreads on the same core), and `pause` is simply an alias to make it more clear. In that case, ignore me and my inability to keep up with Intel shenanigans. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8578#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8578: Improvements to SpinLock implementation -------------------------------------+------------------------------------ Reporter: parcs | Owner: parcs Type: task | Status: patch Priority: normal | Milestone: Component: Runtime System | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by simonmar): Here are my results with `-N4` on an Intel Core i7-3770 (4 cores, 8 threads). {{{ -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- blackscholes +0.0% +0.0% -1.7% -2.4% -0.3% coins +0.0% -0.0% +0.4% +1.0% -8.6% gray +0.0% +0.0% +15.1% +14.3% +0.0% mandel +0.0% +0.0% +3.3% +3.3% -0.8% matmult +0.0% +8.1% -2.4% -2.6% +0.0% minimax +0.0% +0.0% -1.3% -1.1% +0.0% nbody +0.0% -6.0% -1.9% 0.06 +0.0% parfib +0.0% +0.1% +16.2% +16.2% +0.0% partree +0.0% -0.0% +1.0% +0.5% -3.0% prsa +0.0% -0.1% +1.1% +0.9% +0.0% queens +0.0% -0.5% -1.3% -0.5% +7.1% ray +0.0% -0.3% -0.4% -0.5% +0.0% sumeuler +0.0% +0.0% +1.0% +1.0% +0.0% transclos +0.0% +0.0% +1.2% +1.4% +0.0% -------------------------------------------------------------------------------- Min +0.0% -6.0% -2.4% -2.6% -8.6% Max +0.0% +8.1% +16.2% +16.2% +7.1% Geometric Mean +0.0% +0.1% +2.0% +2.3% -0.4% }}} Not good! Two programs (gray and parfib) are significantly worse. The effect is real, here is the timing info for parfib before and after: {{{ 5.70user 0.00system 0:01.43elapsed 397%CPU (0avgtext+0avgdata 20816maxresident)k 6.52user 0.00system 0:01.64elapsed 397%CPU (0avgtext+0avgdata 21568maxresident)k }}} I wonder whether not using a locked instruction in the spinlock might cause the loop to spin for longer, because it takes longer for the memory write to reach the core that is waiting for it? Someone could probably dig into this further with perf. But the lesson here, as usual, is to always benchmark and don't just assume that because it looks good it will work in practice! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8578#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8578: Improvements to SpinLock implementation -------------------------------------+------------------------------------ Reporter: parcs | Owner: Type: task | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Changes (by simonmar): * owner: parcs => * status: patch => new Comment: Moving out of the patch state pending further investigation. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8578#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8578: Improvements to SpinLock implementation -------------------------------------+------------------------------------- Reporter: parcs | Owner: Type: task | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by thomie): * failure: None/Unknown => Runtime performance bug -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8578#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8578: Improvements to SpinLock implementation -------------------------------------+------------------------------------- Reporter: parcs | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by dfeuer): Since thoughtpolice mentioned them, I'm wondering if there are specific places we should look at as HLE (or perhaps RTM) candidates. I don't ''think'' we can rely on the system glibc to use them to implement mutexes unless we ask for that specifically in some fashion. I'd expect very small and short transactions (most but not all of what we do) to benefit. Has anyone looked into that yet? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8578#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8578: Improvements to SpinLock implementation -------------------------------------+------------------------------------- Reporter: parcs | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Runtime System | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by dfeuer): * cc: dfeuer (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8578#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC