
#8578: Improvements to SpinLock implementation -------------------------------------+------------------------------------ Reporter: parcs | Owner: parcs Type: task | Status: patch Priority: normal | Milestone: Component: Runtime System | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by thoughtpolice): Very good stuff Patrick looks nice to me - and I can see how it'd help scalability. I did a quick glance and one question I have is why are we using `busy_wait_nop` in the wait loop, which just becomes `rep; nop` on x86? For spinlock implementations, Intel at least has a `PAUSE` instruction which specifically informs the CPU that this is a spinlock wait-loop, which allows it to optimize cache and memory accesses if possible to avoid memory ordering violations, requiring the CPUs to synchronize. This can be quite dramatic on older processors I believe (I wouldn't be surprised if `rep; nop` devolved in `pause` on some microarchs, but probably not fully guaranteed.) `PAUSE` might also actually *delay* the CPU for this optimization to happen, where as `rep nop` will merely run as fast as possible. So I'd suggest replacing `busy_wait_nop` on x86 with something like `_pause_mm` from GCC: {{{ #define _mm_pause() \ ({ \ __asm__ __volatile__("pause" ::: "memory"); \ }) }}} Finally, one thing I did for fun a while back was write an accelerated spinlock implementation using the new TSX extensions in Intel processors. In my very non-scientific experiments, this essentially eliminated any lock contention (or even any need to take a lock) in a lot of cases. I wonder if it's worth adding here to see if there's any difference in throughput or contention. You can find the code here (pinpointed to the code in question): https://gist.github.com/thoughtpolice/7123036#file-spinlock-rtm-c-L230 Note my example does *not* use lock elision prefixes for `xchg` (which are backwards compatible,) but instead uses the new TSX RTM instructions to wrap locks/unlocks in a transaction, allowing speculative elision. In practice this works just as well and RTM is more flexible. It would however require checking `cpuid` to see if TSX is available and if so, dynamically replacing the code path as I have done. On Haswell, this indirect-branch would probably be predicted so well its overhead is basically zero, but I don't know what it might do to e.g. AMD machines in terms of prediction or throughput. In any case - Patrick, thanks. If you'd like to test the elision thing, or the impact of `_mm_pause`, I have a powerful Haswell server available (8 hardware threads/32GB RAM) you can play around with for testing. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8578#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler