
On 07/03/10 14:41, Jan-Willem Maessen wrote:
On Mar 3, 2010, at 8:44 AM, Simon Marlow wrote:
On 01/03/2010 21:20, Michael Lesniak wrote:
Hello Bryan,
The parallel GC currently doesn't behave well with concurrent programs that uses multiple capabilities (aka OS threads), and the behaviour you see is the known symptom of this.. I believe that Simon Marlow has some fixes in hand that may go into 6.12.2.
It's more correct to say the parallel GC has difficulty when one of its threads is descheduled by the OS, because the other threads just spin waiting for it. Presumably some kernels are more susceptible than others due to differences in scheduling policy, I know they've been fiddling around with this a lot in Linux recently.
You typically don't see a problem when there are spare cores, the slowdown manifests when you are trying to use all the cores in your machine, so it affects people on dual-cores quite a lot. This probably explains why I've not been particularly affected by this myself, since I do most of my benchmarking on an 8-core box.
The fix that will be in 6.12.2 is to insert some yields, so that threads will yield rather than spinning indefinitely, and this seems to help a lot.
Be warned that inserting yield into a spin loop is also non-portable, and may make the problem *worse* on some systems.
The problem is that "yield" calls can be taken by the scheduler to mean "See, I'm a nice thread, giving up the core when I don't need it. Please give me extra Scheduling Dubloons."
Now let's say 7 of your 8 threads are doing this. It's likely that each one will yield to the next, and the 8th thread (the one you actually want on-processor) could take a long time to bubble up and get its moment. At one time on Solaris you could even livelock (because the scheduler didn't try particularly hard to be fair in the case of multiple yielding threads in a single process)---but that was admittedly a long time ago.
How depressing, thanks for that :)
The only recourse I know about is to tell the OS you're doing synchronization (by using OS-visible locking calls, say the ones in pthreads or some of the lightweight calls that Linux has added for the purpose). Obviously this has a cost if anyone falls out of the spin loop---and it's pretty likely some thread will have to wait a while.
Yes, so we tried using futexes on Linux, there's an experimental patch attached to http://hackage.haskell.org/trac/ghc/ticket/3553 it was definitely slower than the spinlocks on the benchmarks I tried. I think the problem is that we're using these spinlocks to synchronise across all cores, and it's likely that these loops will have to spin for a while before exiting becuase one or more of the running cores takes a while to get to a safe point. But really giving up the core and blocking is a lot worse, becuas the wakeup time is so long (you can see it pretty clearly in ThreadScope). Anyway, I hope all this is just a temporary problem until we get CPU-independent GC working. Cheers, Simon