Re: [GHC] #8578: Improvements to SpinLock implementation

30 Nov 2013

      #8578: Improvements to SpinLock implementation
-------------------------------------+------------------------------------
        Reporter:  parcs             |            Owner:  parcs
            Type:  task              |           Status:  patch
        Priority:  normal            |        Milestone:
       Component:  Runtime System    |          Version:  7.7
      Resolution:                    |         Keywords:
Operating System:  Unknown/Multiple  |     Architecture:  Unknown/Multiple
 Type of failure:  None/Unknown      |       Difficulty:  Unknown
       Test Case:                    |       Blocked By:
        Blocking:                    |  Related Tickets:
-------------------------------------+------------------------------------

Comment (by thoughtpolice):

 Very good stuff Patrick looks nice to me - and I can see how it'd help
 scalability.

 I did a quick glance and one question I have is why are we using
 `busy_wait_nop` in the wait loop, which just becomes `rep; nop` on x86?

 For spinlock implementations, Intel at least has a `PAUSE` instruction
 which specifically informs the CPU that this is a spinlock wait-loop,
 which allows it to optimize cache and memory accesses if possible to avoid
 memory ordering violations, requiring the CPUs to synchronize. This can be
 quite dramatic on older processors I believe (I wouldn't be surprised if
 `rep; nop` devolved in `pause` on some microarchs, but probably not fully
 guaranteed.) `PAUSE` might also actually *delay* the CPU for this
 optimization to happen, where as `rep nop` will merely run as fast as
 possible. So I'd suggest replacing `busy_wait_nop` on x86 with something
 like `_pause_mm` from GCC:

 {{{
 #define _mm_pause()                             \
   ({                                            \
     __asm__ __volatile__("pause" ::: "memory"); \
   })
 }}}

 Finally, one thing I did for fun a while back was write an accelerated
 spinlock implementation using the new TSX extensions in Intel processors.
 In my very non-scientific experiments, this essentially eliminated any
 lock contention (or even any need to take a lock) in a lot of cases. I
 wonder if it's worth adding here to see if there's any difference in
 throughput or contention. You can find the code here (pinpointed to the
 code in question):

 https://gist.github.com/thoughtpolice/7123036#file-spinlock-rtm-c-L230

 Note my example does *not* use lock elision prefixes for `xchg` (which are
 backwards compatible,) but instead uses the new TSX RTM instructions to
 wrap locks/unlocks in a transaction, allowing speculative elision. In
 practice this works just as well and RTM is more flexible.

 It would however require checking `cpuid` to see if TSX is available and
 if so, dynamically replacing the code path as I have done. On Haswell,
 this indirect-branch would probably be predicted so well its overhead is
 basically zero, but I don't know what it might do to e.g. AMD machines in
 terms of prediction or throughput.

 In any case - Patrick, thanks. If you'd like to test the elision thing, or
 the impact of `_mm_pause`, I have a powerful Haswell server available (8
 hardware threads/32GB RAM) you can play around with for testing.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8578#comment:4
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler