Re: [Haskell-cafe] Re: GHC's parallel garbage collector -- what am I doing wrong?

7 Mar 2010

      On 07/03/10 14:41, Jan-Willem Maessen wrote:
...
On Mar 3, 2010, at 8:44 AM, Simon Marlow wrote:
...
On 01/03/2010 21:20, Michael Lesniak wrote:
...
Hello Bryan,
...
The parallel GC currently doesn't behave well with concurrent
programs that uses multiple capabilities (aka OS threads), and
the behaviour you see is the known symptom of this.. I believe
that Simon Marlow has some fixes in hand that may go into
6.12.2.
It's more correct to say the parallel GC has difficulty when one of
its threads is descheduled by the OS, because the other threads
just spin waiting for it.  Presumably some kernels are more
susceptible than others due to differences in scheduling policy, I
know they've been fiddling around with this a lot in Linux
recently.
You typically don't see a problem when there are spare cores, the
slowdown manifests when you are trying to use all the cores in your
machine, so it affects people on dual-cores quite a lot. This
probably explains why I've not been particularly affected by this
myself, since I do most of my benchmarking on an 8-core box.
The fix that will be in 6.12.2 is to insert some yields, so that
threads will yield rather than spinning indefinitely, and this
seems to help a lot.
Be warned that inserting yield into a spin loop is also non-portable,
and may make the problem *worse* on some systems.
The problem is that "yield" calls can be taken by the scheduler to
mean "See, I'm a nice thread, giving up the core when I don't need
it.  Please give me extra Scheduling Dubloons."
Now let's say 7 of your 8 threads are doing this.  It's likely that
each one will yield to the next, and the 8th thread (the one you
actually want on-processor) could take a long time to bubble up and
get its moment.  At one time on Solaris you could even livelock
(because the scheduler didn't try particularly hard to be fair in the
case of multiple yielding threads in a single process)---but that was
admittedly a long time ago.
How depressing, thanks for that :)
...
The only recourse I know about is to tell the OS you're doing
synchronization (by using OS-visible locking calls, say the ones in
pthreads or some of the lightweight calls that Linux has added for
the purpose).  Obviously this has a cost if anyone falls out of the
spin loop---and it's pretty likely some thread will have to wait a
while.
Yes, so we tried using futexes on Linux, there's an experimental patch 
attached to

http://hackage.haskell.org/trac/ghc/ticket/3553

it was definitely slower than the spinlocks on the benchmarks I tried.

I think the problem is that we're using these spinlocks to synchronise 
across all cores, and it's likely that these loops will have to spin for 
a while before exiting becuase one or more of the running cores takes a 
while to get to a safe point.  But really giving up the core and 
blocking is a lot worse, becuas the wakeup time is so long (you can see 
it pretty clearly in ThreadScope).

Anyway, I hope all this is just a temporary problem until we get 
CPU-independent GC working.

Cheers,
	Simon