Re: parallel garbage collection performance

26 Jun 2012


      Thanks very much for this information.  My observations match your
recommendations, insofar as I can test them.

Cheers,
John

On Mon, Jun 25, 2012 at 11:42 PM, Simon Marlow  wrote:
...
On 19/06/12 02:32, John Lato wrote:
...
Thanks for the suggestions.  I'll try them and report back.  Although
I've since found that out of 3 not-identical systems, this problem
only occurs on one.  So I may try different kernel/system libs and see
where that gets me.
-qg is funny.  My interpretation from the results so far is that, when
the parallel collector doesn't get stalled, it results in a big win.
But when parGC does stall, it's slower than disabling parallel gc
entirely.
Parallel GC is usually a win for idiomatic Haskell code, it may or may not
be a good idea for things like Repa - I haven't done much analysis of those
types of programs yet.  Experiment with the -A flag, e.g. -A1m is often
better than the default if your processor has a large cache.
However, the parallel GC will be a problem if one or more of your cores is
being used by other process(es) on the machine.  In that case, the GC
synchronisation will stall and performance will go down the drain.  You can
often see this on a ThreadScope profile as a big delay during GC while the
other cores wait for the delayed core.  Make sure your machine is quiet
and/or use one fewer cores than the total available.  It's not usually a
good idea to use hyperthreaded cores either.
I'm also seeing unpredictable performance on a 32-core AMD machine with
NUMA.  I'd avoid NUMA for Haskell for the time being if you can.  Indeed you
get unpredictable performance on this machine even for single-threaded code,
because it makes a difference on which node the pages of your executable are
cached (I heard a rumour that Linux has some kind of a fix for this in the
pipeline, but I don't know the details).
...
I had thought the last core parallel slowdown problem was fixed a
while ago, but apparently not?
We improved matters by inserting some "yield"s into the spinlock loops.
 This helped a lot, but the problem still exists.
Cheers,
       Simon
...
Thanks,
John
On Tue, Jun 19, 2012 at 8:49 AM, Ben Lippmeier  wrote:
...
On 19/06/2012, at 24:48 , Tyson Whitehead wrote:
...
On June 18, 2012 04:20:51 John Lato wrote:
...
Given this, can anyone suggest any likely causes of this issue, or
anything I might want to look for?  Also, should I be concerned about
the much larger gc_alloc_block_sync level for the slow run?  Does that
indicate the allocator waiting to alloc a new block, or is it
something else?  Am I on completely the wrong track?
A total shot in the dark here, but wasn't there something about really
bad
performance when you used all the CPUs on your machine under Linux?
Presumably very tight coupling that is causing all the threads to stall
everytime the OS needs to do something or something?
This can be a problem for data parallel computations (like in Repa). In
Repa all threads in the gang are supposed to run for the same time, but if
one gets swapped out by the OS then the whole gang is stalled.
I tend to get best results using -N7 for an 8 core machine.
It is also important to enable thread affinity (with the -qa) flag.
For a Repa program on an 8 core machine I use +RTS -N7 -qa -qg
Ben.
_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users