
Hello GHC devs, Through the course of reading some recent material posted by Linus Torvalds ^1 ^2, I remembered stumbling upon GHC Trac #3553 ^3 some years ago. Looking past the harsh tone Linus used in his notes, I think he makes some pretty reasonable points about the problems that can be caused by using spinlocks in userspace. Specifically: - A group of threads yielding while waiting for a spinlock are, in effect, simply trying to coax the scheduler into scheduling the thread that is holding the lock. This is much easier for the scheduler to do correctly with a futex or similar, since the scheduler has some context around which tasks are blocking/waking each other up. - Apparently the Linux scheduler (and maybe other schedulers) has some smarts for preserving cache locality while choosing which physical execution units run which tasks, and reordering threads in the run list in an ad-hoc way with sched_yield() interferes with this mechanism. - Using sched_yield() (or other random interference with run lists) can cause disproportionate havoc on NUMA systems, where jostling around the lock-holding thread by burning time slices and yielding is especially costly. All that said, there is perhaps one concrete advantage (aside from absolute performance) to the current spinlock implementation: the only functionality it requires from the host OS is a yield-like operation. A yielding spinlock occupies a comfortable local optimum, achieving decent performance across platforms with minimal exposure to cross-OS API differences. Revisiting #3553, it seems the only reason that a futex was not used in place of a spinlock with sched_yield() was that, despite all of the points against it, the spinlock code empirically performed better. However, these tests were performed years ago and the futex implementation in the Linux kernel has changed quite a bit. I think there is a healthy amount of empirical evidence from the GHC user community that there are problems afoot with the way parallel GC does locking, and indeed I wonder if the bad cases Linus described are playing a role. Advice like "use -qg" or "use -qnM with small M" is often parroted (this Twitter thread ^4 contains both pieces of advice). Furthermore, I would wager that an RTS using futexes for locking rather than spinning plays nicer with other CPU intensive tasks on the same machine. The "scheduler storm" caused by a large number of threads waiting for a spinlock could have a negative impact on neighboring processes, e.g. a large number of HECs spinning certainly don't do any favors for a busy neighboring DB process. I'm curious if others have thoughts on reviving the futex investigation. Perhaps the futexes provided by modern Linux are more competitive with the current spinlock implementation, or perhaps better scaling on machines with high core counts is worth some cost. In the case that futexes remain handily defeated by yielding spinlocks, it's a worthy endeavor to explain precisely _why_ this happens. Other programs have seen performance issues crop up when the kernel makes subtle changes to how sched_yield() works. ^5 1: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189723 2: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752 3: https://gitlab.haskell.org/ghc/ghc/issues/3553 4: https://twitter.com/ndm_haskell/status/1076187051610042368?s=20 5: https://lwn.net/Articles/31462/

There are more related updates in https://gitlab.haskell.org/ghc/ghc/issues/9221, also including a short discussion of Linus's post. Simon Marlow's overall response was:
I'm very supportive of making this better, but as usual I will require thorough data to back up any changes :)
Everything I tried in the past made things worse. Including an experiment I did to use futexes directly: https://gitlab.haskell.org/ghc/ghc/issues/3553?cversion=0&cnum_hist=14#note_39009
So it sounds like this topic is currently in the stage of: Somebody needs to take the time to re-do that benchmark done 10 years ago.

Travis Whitaker
Hello GHC devs,
Through the course of reading some recent material posted by Linus Torvalds ^1 ^2, I remembered stumbling upon GHC Trac #3553 ^3 some years ago.
For what it's worth Simon Marlow and I discussed this a few weeks ago and he agreed that it would be worth re-running the futex experiments. I do suspect there is good money on the table here on larger/busier machines. It would be great if someone could pick this up! Cheers, - Ben

Ben Gamari
Travis Whitaker
writes: Hello GHC devs,
Through the course of reading some recent material posted by Linus Torvalds ^1 ^2, I remembered stumbling upon GHC Trac #3553 ^3 some years ago.
For what it's worth Simon Marlow and I discussed this a few weeks ago and he agreed that it would be worth re-running the futex experiments.
Whoops. The above should have read "mutex" experiments. Of course, perhaps direct futex usage is also worth evaluating but simpler is better if there is no performance trade-off. Cheerss, - Ben
participants (3)
-
Ben Gamari
-
Niklas Hambüchen
-
Travis Whitaker