[GHC] #8972: Investigate adding fast compare-and-swap Int type/primops

#8972: Investigate adding fast compare-and-swap Int type/primops ------------------------------------+------------------------------------- Reporter: tibbe | Owner: tibbe Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Keywords: | Operating System: Unknown/Multiple Architecture: Unknown/Multiple | Type of failure: None/Unknown Difficulty: Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | ------------------------------------+------------------------------------- I've received reports that using `IORef Int` and `atomicModifyIORef` to implement an atomic counter in the ekg package has become a bottleneck for some of its users. These users update the counter thousands of times per second, using multiple threads. I will investigate whether adding a dedicated atomic `Int` reference type will offer significant speed improvements. Such a type can also be used to implement cheaper locks (by using bits in the int to represent different lock states, such as reader/write locks.) Lets call this new type `AtomicIntRef` for now. This new type needs to support at least these functions: {{{#!haskell add :: AtomicIntRef -> Int -> IO Int set :: AtomicIntRef -> Int -> IO () get :: AtomicIntRef -> IO Int }}} `add` would be implemented using the `lock` and `xaddq` instructions. `set` and `get` are just simple loads and stores on x86, as these are atomic. We might also want to consider having other functions, such as a `cas`. Furthermore, there are subtleties with memory barriers that might motivate having barrier/barrier-less versions of some functions. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8972 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8972: Investigate adding fast compare-and-swap Int type/primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: tibbe Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by tibbe): I've added a little benchmark that shows that even calling out to C can speed things up a lot. Using `IORef`: {{{ $ ghc -O2 Bench.hs counter.c -DOLD=1 ... $ time ./Bench real 0m1.698s user 0m1.670s sys 0m0.020s }}} Calling out to C: {{{ $ ghc -O2 Bench.hs counter.c ... $ time ./Bench real 0m0.108s user 0m0.090s sys 0m0.010s }}} That's a ~16x speed-up. Things look similar with the threaded RTS: Using `IORef`: {{{ $ ghc -O2 Bench.hs counter.c -DOLD=1 -threaded ... $ time ./Bench real 0m1.998s user 0m1.980s sys 0m0.010s }}} Calling out to C: {{{ $ ghc -O2 Bench.hs counter.c -threaded $ time ./Bench real 0m0.117s user 0m0.100s sys 0m0.010s }}} With `+RTS -N6` (real hardware cores on the test machine): Using `IORef`: {{{ $ time ./Bench +RTS -N6 real 1m22.565s user 1m23.850s sys 1m11.240s }}} Calling out to C: {{{ $ time ./Bench +RTS -N6 real 0m0.247s user 0m1.340s sys 0m0.010s }}} Atrocious results for the `IORef` solution! It seems like contended `IORef`s don't work well at all. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8972#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8972: Investigate adding fast compare-and-swap Int type/primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: tibbe Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: 8157,7883 -------------------------------------+------------------------------------ Changes (by carter): * related: => 8157,7883 Comment: unboxed references! Yes, that'd be a good idea. Related tickets are #8157 and #7883 now that 7.8 is cut, i can take some time to work on #7883 and have some hope for getting it merged in. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8972#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8972: Investigate adding fast compare-and-swap Int type/primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: tibbe Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: 8157, 7883 -------------------------------------+------------------------------------ Changes (by carter): * related: 8157,7883 => 8157, 7883 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8972#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8972: Investigate adding fast compare-and-swap Int type/primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: tibbe Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: 8157, 7883 -------------------------------------+------------------------------------ Comment (by rrnewton): Update: I confirmed that I get the exact same time for "Data.Atomics.Counter' (atomic-primops) as I do for Johan (tibbe)'s counter.c version of the benchmark here. Johan and I talked about this and it seems like the obvious way to improve performance on this is inline primops for the fetch-and-add (#7883). We confirmed that the MutableByteArray# used by Data.Atomics.Counter is a single contiguous heap object, so the only advantage of a dedicated mutable reference type would be to save the word representing the length. More, generally, I wonder if we can have a user facing interface that makes mutable unboxed data more pleasant. Unboxed vectors are nice, but there's nothing like an (IORef (# Int#, Int# #)), for example. Even though with double-word CAS we could support some interesting data types... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8972#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8972: Investigate adding fast compare-and-swap Int type/primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: tibbe Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: 8157, 7883 -------------------------------------+------------------------------------ Comment (by jberryman): FWIW... Replying to [comment:1 tibbe]:
Atrocious results for the `IORef` solution! It seems like contended
`IORef`s don't work well at all. This is my experience. The behavior I've seen is interesting; check out the distribution of samples of a test of 100K concurrent `atomicModifyIORef`s, for 2 cores: [http://htmlpreview.github.io/?https://github.com/jberryman/chan- benchmarks/blob/master/reports/ghc-7.6/bench-multi.vars.html#b2] ...and 8 cores: [http://htmlpreview.github.io/?https://github.com/jberryman/chan- benchmarks/blob/master/reports/ghc-7.6_8-core/bench-multi.vars.html#b2] The `atomicModifyIORefCAS` from atomic-primops doesn't display that nasty spread; with the primitive CAS one thread always wins, but it seems like whatever GHC is doing is causing vicious retry loops, which sometimes stay in phase for awhile. And the fetch-and-add-based counter from `atomic-primops` handles contention far and away better than anything else available. From what I remember I couldn't even measure contention effects going from 2 to 8 contending threads (although I did do both tests on two different machines, so...). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8972#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8972: Investigate adding fast compare-and-swap Int type/primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: tibbe Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: 8157, 7883 -------------------------------------+------------------------------------ Comment (by rrnewton): Thanks for these benchmarks! I suspect the problem with atomicModifyIORef's poor behavior may have to do with blackholes? Tracing ghc events will perturb these workloads a bunch I guess, but maybe threadscope could still tell us something wrt blackhole events. By the way, I started peeking at the benchmark code. I don't have a good sense currently of how many cycles each of the following takes in practice: * between a forkIO and the new thread waking up on another CPU. * between two forkIO's from a parent thread waking up (gang synchrony) * delta between two threads blocked on the same MVar waking up I noticed the third method used in MainN.hs and I think this is very standard. However, I wonder if it is worth having all the test threads busy-wait on a memory location to start with a greater degree of synchrony. That way we can have them run for exactly the same time interval and maximize contention over the whole time period ;-). In any case, these benchmarks are great. I'd like to run these on a few different machines with our nightly benchmarking stuff. I read the README but I can't currently build the benchmarks because I don't have chan- split-fast -- where is that? P.S. We currently accumulate benchmark data into Google Fusion Tables (but we'd like to switch to Big Query or Cloud SQL or something). HSBencher does the upload, but it doesn't yet integrate with criterion well. It's more focused on long-running (parallel) benchmarks rather than statistically significant sampling of short runs. Still, adding support for consuming criterion reports would be great. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8972#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

By the way, I started peeking at the benchmark code. I don't have a good sense currently of how many cycles each of the following takes in
#8972: Investigate adding fast compare-and-swap Int type/primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: tibbe Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: 8157, 7883 -------------------------------------+------------------------------------ Comment (by jberryman): Replying to [comment:6 rrnewton]: practice:
* between a forkIO and the new thread waking up on another CPU.
.......
* between two forkIO's from a parent thread waking up (gang synchrony)
Good question! Don't have enough cores on my real machine to look at that right now.
* delta between two threads blocked on the same MVar waking up
I noticed the third method used in MainN.hs and I think this is very standard. However, I wonder if it is worth having all the test threads busy-wait on a memory location to start with a greater degree of synchrony. That way we can have them run for exactly the same time interval and maximize contention over the whole time period ;-).
I think that's a good idea. I started doing that in later tests/benchmarks: pass an `IORef Int` to each forked thread, each thread increments the ref and then busy-waits until the counter reaches the target.
In any case, these benchmarks are great. I'd like to run these on a few
different machines with our nightly benchmarking stuff. I read the README but I can't currently build the benchmarks because I don't have chan- split-fast -- where is that? Sorry, this whole project is really just my personal messy scratchwork; even the README is just personal notes that are out of date anyway, and I should probably just remove it. You'll have a hard time building it for a number of reasons, and might have better luck just copy-pasting what looks interesting. I'm happy to help extract and clean up any of them that you're interested in.
P.S. We currently accumulate benchmark data into Google Fusion Tables
(but we'd like to switch to Big Query or Cloud SQL or something). HSBencher does the upload, but it doesn't yet integrate with criterion well. It's more focused on long-running (parallel) benchmarks rather than statistically significant sampling of short runs. Still, adding support for consuming criterion reports would be great. Oh cool, I'd like to look into how you're doing all that. Yeah I was initially nervous about using criterion for these kinds of benchmarks (involving concurrency) but it's actually worked really well, and seeing the distributions has been really enlightening. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8972#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Replying to [comment:6 rrnewton]:
By the way, I started peeking at the benchmark code. I don't have a good sense currently of how many cycles each of the following takes in
#8972: Investigate adding fast compare-and-swap Int type/primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: tibbe Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: 8157, 7883 -------------------------------------+------------------------------------ Comment (by jberryman): Replying to [comment:7 jberryman]: practice:
* between a forkIO and the new thread waking up on another CPU.
.......
Oops sorry! Here I meant to write that I'd sort of benchmarked this, but just added a better test (fork a `putMVar`, then block on `takeMVar` in main thread) which I measure at between 5.4 - 5.8 μs . So as long as you can get your contention test runs into the milliseconds range those effects should be completely hidden. Feel free to email me if this is distracting from this ticket and you want to chat further. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8972#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8972: Investigate adding fast compare-and-swap Int type/primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: tibbe Type: feature request | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: #8157, #7883 -------------------------------------+------------------------------------ Changes (by hvr): * related: 8157, 7883 => #8157, #7883 * milestone: => 7.10.1 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8972#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8972: Investigate adding fast compare-and-swap Int type/primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: tibbe Type: feature request | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: #8157, #7883 -------------------------------------+------------------------------------ Comment (by nominolo): Replying to [comment:6 rrnewton]:
I suspect the problem with atomicModifyIORef's poor behavior may have to do with blackholes? Tracing ghc events will perturb these workloads a bunch I guess, but maybe threadscope could still tell us something wrt blackhole events.
We were affected by this issue and looked at the thread status of all threads and noticed that they all get blocked on a blackhole. We suspect there are two issues here: 1. Poor cache behaviour due to the CAS retry busy loop. 2. All the threads get queued up waiting for the thread that currently owns the blackhole to get scheduled again. The second point is a potential issue for many other applications (e.g,. if a thread is holding on to an MVar). In the operating system world they deal with those kinds of issues either using: - interrupt blocking: i.e., a thread cannot be de-scheduled while holding the lock - time-slice donation / priority boosting: If a thread enters a critical section and still has some alotted compute time left, it "donates" that time to the thread owning the critical section, so that it will make progress and leave the critical section. If you have thread priorities then you can temporarily raise the priority of the thread in the critical section. This rather complicated because priority boosting may have to be recursive (you have to boost to the priority of the highest-priority thread waiting), so not everyone likes this method. For GHC, I think it could be useful to have both mechanisms available. Marking a thread as non-interruptable could be exported from a "Here be dragons" module for using when you know that a thread will hold a lock/MVar only for a very short time. For thunks, when a thread blocks on a blackhole, the scheduler should probably then schedule that thread next. This way, the time spent evaluating a blackhole will be proportional to the number of threads waiting on it. I don't know if the reduced fairness causes issues here, though. You could mark a blackhole as "boosted" meaning, that as soon the blackhole is updated we not just wake up all threads in the run-queue, but also yield to the first thread in the blackhole. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8972#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8972: Investigate adding fast compare-and-swap Int type/primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: tibbe Type: feature request | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: #8157, #7883 -------------------------------------+------------------------------------ Changes (by simonmar): * cc: simonmar (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8972#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8972: Investigate adding fast compare-and-swap Int type/primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: tibbe Type: feature request | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: #8157, #7883 -------------------------------------+------------------------------------ Comment (by jberryman): Replying to [comment:10 nominolo]:
For GHC, I think it could be useful to have both mechanisms available. Marking a thread as non-interruptable could be exported from a "Here be dragons" module for using when you know that a thread will hold a lock/MVar only for a very short time.
Would this be something like `mask` but for scheduling? If so I want. From a library-writers point of view this would be a really easy way to get rid of a lot of bad scheduling effects, e.g. in probably every use of `modifyMVar`. Sorry if I'm not understanding. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8972#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8972: Investigate adding fast compare-and-swap Int type/primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: tibbe Type: feature request | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.9 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: #8157, #7883 -------------------------------------+------------------------------------ Changes (by ihameed): * cc: idhameed@… (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8972#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8972: Investigate adding fast compare-and-swap Int type/primops -------------------------------------+------------------------------------ Reporter: tibbe | Owner: tibbe Type: feature request | Status: closed Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.9 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: #8157, #7883 -------------------------------------+------------------------------------ Changes (by tibbe): * status: new => closed * resolution: => fixed Comment: We added atomic primops for byte arrays (which can be used to implement a single cell mutable variable) in d8abf85f8ca176854e9d5d0b12371c4bc402aac3. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8972#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC