parallel garbage collection performance

Hello, I have a program that is intermittently experiencing performance issues that I believe are related to parallel GC, and I was hoping to get some advice on how I might improve it. Essentially, any given execution is either slow or fast (the same executable, without recompiling), most often slow. So far I can't find anything that would trigger either case. This is with ghc-7.4.2 on 64bit linux. Here are the statistics from running with -N4 -A8m -s: slow run: 16,647,460,328 bytes allocated in the heap 313,767,248 bytes copied during GC 17,305,120 bytes maximum residency (22 sample(s)) 601,952 bytes maximum slop 73 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1268 colls, 1267 par 8.62s 8.00s 0.0063s 0.0389s Gen 1 22 colls, 22 par 0.63s 0.60s 0.0275s 0.0603s Parallel GC work balance: 1.53 (39176141 / 25609887, ideal 4) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 0.01s) 0.00s ( 0.00s) Task 1 (worker) : 0.00s ( 13.66s) 0.01s ( 0.04s) Task 2 (bound) : 0.00s ( 13.98s) 0.00s ( 0.00s) Task 3 (worker) : 0.00s ( 18.14s) 0.16s ( 0.44s) Task 4 (worker) : 0.53s ( 17.49s) 1.29s ( 4.25s) Task 5 (worker) : 0.00s ( 17.45s) 1.25s ( 4.42s) Task 6 (worker) : 0.00s ( 14.98s) 1.75s ( 6.90s) Task 7 (worker) : 0.00s ( 21.87s) 0.02s ( 0.06s) Task 8 (worker) : 0.01s ( 37.12s) 0.06s ( 0.17s) Task 9 (worker) : 0.00s ( 21.41s) 4.88s ( 15.99s) Task 10 (worker) : 0.84s ( 43.06s) 1.99s ( 8.25s) Task 11 (bound) : 6.39s ( 51.13s) 0.06s ( 0.18s) Task 12 (worker) : 0.00s ( 0.00s) 8.04s ( 21.42s) Task 13 (worker) : 0.43s ( 28.38s) 8.14s ( 22.94s) Task 14 (worker) : 5.35s ( 29.30s) 5.81s ( 22.02s) SPARKS: 7 (7 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.03s ( 0.01s elapsed) MUT time 43.88s ( 42.71s elapsed) GC time 9.26s ( 8.60s elapsed) EXIT time 0.01s ( 0.01s elapsed) Total time 53.65s ( 51.34s elapsed) Alloc rate 374,966,825 bytes per MUT second Productivity 82.7% of total user, 86.4% of total elapsed gc_alloc_block_sync: 1388000 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0 -- -------------------------- Fast run: 42,061,441,560 bytes allocated in the heap 725,062,720 bytes copied during GC 36,963,480 bytes maximum residency (21 sample(s)) 1,382,536 bytes maximum slop 141 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 3206 colls, 3205 par 8.34s 1.87s 0.0006s 0.0089s Gen 1 21 colls, 21 par 0.76s 0.17s 0.0081s 0.0275s Parallel GC work balance: 1.78 (90535973 / 50955059, ideal 4) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 1 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 2 (worker) : 0.00s ( 11.50s) 0.00s ( 0.00s) Task 3 (worker) : 0.00s ( 12.40s) 0.00s ( 0.00s) Task 4 (worker) : 0.58s ( 12.20s) 0.59s ( 0.61s) Task 5 (bound) : 0.00s ( 12.89s) 0.00s ( 0.00s) Task 6 (worker) : 0.00s ( 13.40s) 0.02s ( 0.02s) Task 7 (worker) : 0.00s ( 14.66s) 0.00s ( 0.00s) Task 8 (worker) : 0.95s ( 14.18s) 0.69s ( 0.76s) Task 9 (worker) : 2.82s ( 13.50s) 1.37s ( 1.44s) Task 10 (worker) : 1.72s ( 17.59s) 1.07s ( 1.16s) Task 11 (worker) : 3.99s ( 24.68s) 0.37s ( 0.38s) Task 12 (worker) : 1.24s ( 24.25s) 0.80s ( 0.82s) Task 13 (bound) : 6.18s ( 25.02s) 0.04s ( 0.04s) Task 14 (worker) : 1.46s ( 23.42s) 1.59s ( 1.65s) Task 15 (worker) : 0.00s ( 0.00s) 0.66s ( 0.66s) Task 16 (worker) : 11.00s ( 23.36s) 1.67s ( 1.70s) SPARKS: 28 (28 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.04s ( 0.02s elapsed) MUT time 42.08s ( 23.02s elapsed) GC time 9.10s ( 2.04s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 51.69s ( 25.09s elapsed) Alloc rate 987,695,300 bytes per MUT second Productivity 82.3% of total user, 169.6% of total elapsed gc_alloc_block_sync: 164572 whitehole_spin: 0 gen[0].sync: 164 gen[1].sync: 18147 When I record an eventlog and view it with Threadscope, the slow run shows long frequent pauses for GC, whereas on a fast run GC is extremely fast. Running with the parallel collector disabled (-qg) is more-or-less consistently between these two runs. Given this, can anyone suggest any likely causes of this issue, or anything I might want to look for? Also, should I be concerned about the much larger gc_alloc_block_sync level for the slow run? Does that indicate the allocator waiting to alloc a new block, or is it something else? Am I on completely the wrong track? Thanks very much, John L.

On June 18, 2012 04:20:51 John Lato wrote:
Given this, can anyone suggest any likely causes of this issue, or anything I might want to look for? Also, should I be concerned about the much larger gc_alloc_block_sync level for the slow run? Does that indicate the allocator waiting to alloc a new block, or is it something else? Am I on completely the wrong track?
A total shot in the dark here, but wasn't there something about really bad performance when you used all the CPUs on your machine under Linux? Presumably very tight coupling that is causing all the threads to stall everytime the OS needs to do something or something? Cheers! -Tyson

On 19/06/2012, at 24:48 , Tyson Whitehead wrote:
On June 18, 2012 04:20:51 John Lato wrote:
Given this, can anyone suggest any likely causes of this issue, or anything I might want to look for? Also, should I be concerned about the much larger gc_alloc_block_sync level for the slow run? Does that indicate the allocator waiting to alloc a new block, or is it something else? Am I on completely the wrong track?
A total shot in the dark here, but wasn't there something about really bad performance when you used all the CPUs on your machine under Linux?
Presumably very tight coupling that is causing all the threads to stall everytime the OS needs to do something or something?
This can be a problem for data parallel computations (like in Repa). In Repa all threads in the gang are supposed to run for the same time, but if one gets swapped out by the OS then the whole gang is stalled. I tend to get best results using -N7 for an 8 core machine. It is also important to enable thread affinity (with the -qa) flag. For a Repa program on an 8 core machine I use +RTS -N7 -qa -qg Ben.

I wonder, do we have a Repa FAQ (or similar) that explain such issues? (And is easily discoverable?)
Manuel
Ben Lippmeier
On 19/06/2012, at 24:48 , Tyson Whitehead wrote:
On June 18, 2012 04:20:51 John Lato wrote:
Given this, can anyone suggest any likely causes of this issue, or anything I might want to look for? Also, should I be concerned about the much larger gc_alloc_block_sync level for the slow run? Does that indicate the allocator waiting to alloc a new block, or is it something else? Am I on completely the wrong track?
A total shot in the dark here, but wasn't there something about really bad performance when you used all the CPUs on your machine under Linux?
Presumably very tight coupling that is causing all the threads to stall everytime the OS needs to do something or something?
This can be a problem for data parallel computations (like in Repa). In Repa all threads in the gang are supposed to run for the same time, but if one gets swapped out by the OS then the whole gang is stalled.
I tend to get best results using -N7 for an 8 core machine.
It is also important to enable thread affinity (with the -qa) flag.
For a Repa program on an 8 core machine I use +RTS -N7 -qa -qg
Ben.
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

On 19/06/2012, at 10:59 , Manuel M T Chakravarty wrote:
I wonder, do we have a Repa FAQ (or similar) that explain such issues? (And is easily discoverable?)
I've been trying to collect the main points in the haddocs for the main module [1], but this one isn't there yet. I need to update the Repa tutorial, on the Haskell wiki, and this should also go in it Ben. [1] http://hackage.haskell.org/packages/archive/repa/3.2.1.1/doc/html/Data-Array...

On 19/06/2012, at 13:53 , Ben Lippmeier wrote:
On 19/06/2012, at 10:59 , Manuel M T Chakravarty wrote:
I wonder, do we have a Repa FAQ (or similar) that explain such issues? (And is easily discoverable?)
I've been trying to collect the main points in the haddocs for the main module [1], but this one isn't there yet.
I need to update the Repa tutorial, on the Haskell wiki, and this should also go in it
I also added thread affinity to the Repa FAQ [1]. Ben. [1] http://repa.ouroborus.net/

Thanks for the suggestions. I'll try them and report back. Although
I've since found that out of 3 not-identical systems, this problem
only occurs on one. So I may try different kernel/system libs and see
where that gets me.
-qg is funny. My interpretation from the results so far is that, when
the parallel collector doesn't get stalled, it results in a big win.
But when parGC does stall, it's slower than disabling parallel gc
entirely.
I had thought the last core parallel slowdown problem was fixed a
while ago, but apparently not?
Thanks,
John
On Tue, Jun 19, 2012 at 8:49 AM, Ben Lippmeier
On 19/06/2012, at 24:48 , Tyson Whitehead wrote:
On June 18, 2012 04:20:51 John Lato wrote:
Given this, can anyone suggest any likely causes of this issue, or anything I might want to look for? Also, should I be concerned about the much larger gc_alloc_block_sync level for the slow run? Does that indicate the allocator waiting to alloc a new block, or is it something else? Am I on completely the wrong track?
A total shot in the dark here, but wasn't there something about really bad performance when you used all the CPUs on your machine under Linux?
Presumably very tight coupling that is causing all the threads to stall everytime the OS needs to do something or something?
This can be a problem for data parallel computations (like in Repa). In Repa all threads in the gang are supposed to run for the same time, but if one gets swapped out by the OS then the whole gang is stalled.
I tend to get best results using -N7 for an 8 core machine.
It is also important to enable thread affinity (with the -qa) flag.
For a Repa program on an 8 core machine I use +RTS -N7 -qa -qg
Ben.

Bryan O'Sullivan
On Mon, Jun 18, 2012 at 9:32 PM, John Lato
wrote: I had thought the last core parallel slowdown problem was fixed a while ago, but apparently not?
Simon Marlow has thought so in the not too distant past (since he did the work), if my recollection is correct.
It may very well be fixed for non-dataparallel programs. For dataparallel programs the situation is more tricky as we rely on all threads participating in a DP computation to be scheduled simultaneously. If one Core is currently tied up by the OS, then GHC's RTS can't do anything about that. As it has no concept of gang scheduling (but treats the threads participating in a DP computation individually), it also doesn't know that scheduling a subset of the threads in the gang is counterproductive.

On 19/06/12 02:32, John Lato wrote:
Thanks for the suggestions. I'll try them and report back. Although I've since found that out of 3 not-identical systems, this problem only occurs on one. So I may try different kernel/system libs and see where that gets me.
-qg is funny. My interpretation from the results so far is that, when the parallel collector doesn't get stalled, it results in a big win. But when parGC does stall, it's slower than disabling parallel gc entirely.
Parallel GC is usually a win for idiomatic Haskell code, it may or may not be a good idea for things like Repa - I haven't done much analysis of those types of programs yet. Experiment with the -A flag, e.g. -A1m is often better than the default if your processor has a large cache. However, the parallel GC will be a problem if one or more of your cores is being used by other process(es) on the machine. In that case, the GC synchronisation will stall and performance will go down the drain. You can often see this on a ThreadScope profile as a big delay during GC while the other cores wait for the delayed core. Make sure your machine is quiet and/or use one fewer cores than the total available. It's not usually a good idea to use hyperthreaded cores either. I'm also seeing unpredictable performance on a 32-core AMD machine with NUMA. I'd avoid NUMA for Haskell for the time being if you can. Indeed you get unpredictable performance on this machine even for single-threaded code, because it makes a difference on which node the pages of your executable are cached (I heard a rumour that Linux has some kind of a fix for this in the pipeline, but I don't know the details).
I had thought the last core parallel slowdown problem was fixed a while ago, but apparently not?
We improved matters by inserting some "yield"s into the spinlock loops. This helped a lot, but the problem still exists. Cheers, Simon
Thanks, John
On Tue, Jun 19, 2012 at 8:49 AM, Ben Lippmeier
wrote: On 19/06/2012, at 24:48 , Tyson Whitehead wrote:
On June 18, 2012 04:20:51 John Lato wrote:
Given this, can anyone suggest any likely causes of this issue, or anything I might want to look for? Also, should I be concerned about the much larger gc_alloc_block_sync level for the slow run? Does that indicate the allocator waiting to alloc a new block, or is it something else? Am I on completely the wrong track?
A total shot in the dark here, but wasn't there something about really bad performance when you used all the CPUs on your machine under Linux?
Presumably very tight coupling that is causing all the threads to stall everytime the OS needs to do something or something?
This can be a problem for data parallel computations (like in Repa). In Repa all threads in the gang are supposed to run for the same time, but if one gets swapped out by the OS then the whole gang is stalled.
I tend to get best results using -N7 for an 8 core machine.
It is also important to enable thread affinity (with the -qa) flag.
For a Repa program on an 8 core machine I use +RTS -N7 -qa -qg
Ben.
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

However, the parallel GC will be a problem if one or more of your cores is being used by other process(es) on the machine. In that case, the GC synchronisation will stall and performance will go down the drain. You can often see this on a ThreadScope profile as a big delay during GC while the other cores wait for the delayed core. Make sure your machine is quiet and/or use one fewer cores than the total available. It's not usually a good idea to use hyperthreaded cores either.
Does it ever help to set the number of GC threads greater than numCapabilities to over-partition the GC work? The idea would be to enable some load balancing in the face of perturbation from external load on the machine... It looks like GHC 6.10 had a "-g" flag for this that.... later went away? -Ryan

On 26/06/2012 00:42, Ryan Newton wrote:
However, the parallel GC will be a problem if one or more of your cores is being used by other process(es) on the machine. In that case, the GC synchronisation will stall and performance will go down the drain. You can often see this on a ThreadScope profile as a big delay during GC while the other cores wait for the delayed core. Make sure your machine is quiet and/or use one fewer cores than the total available. It's not usually a good idea to use hyperthreaded cores either.
Does it ever help to set the number of GC threads greater than numCapabilities to over-partition the GC work? The idea would be to enable some load balancing in the face of perturbation from external load on the machine...
It looks like GHC 6.10 had a "-g" flag for this that.... later went away?
The GC threads map one-to-one onto mutator threads now (since 6.12). This change was crucial for performance, before that we hardly ever got any speedup from parallel GC because there was no guarantee of locality. I don't think it would help to have more threads. The load-balancing is already done with work-stealing, it isn't statically partitioned. Cheers, Simon

Thanks very much for this information. My observations match your
recommendations, insofar as I can test them.
Cheers,
John
On Mon, Jun 25, 2012 at 11:42 PM, Simon Marlow
On 19/06/12 02:32, John Lato wrote:
Thanks for the suggestions. I'll try them and report back. Although I've since found that out of 3 not-identical systems, this problem only occurs on one. So I may try different kernel/system libs and see where that gets me.
-qg is funny. My interpretation from the results so far is that, when the parallel collector doesn't get stalled, it results in a big win. But when parGC does stall, it's slower than disabling parallel gc entirely.
Parallel GC is usually a win for idiomatic Haskell code, it may or may not be a good idea for things like Repa - I haven't done much analysis of those types of programs yet. Experiment with the -A flag, e.g. -A1m is often better than the default if your processor has a large cache.
However, the parallel GC will be a problem if one or more of your cores is being used by other process(es) on the machine. In that case, the GC synchronisation will stall and performance will go down the drain. You can often see this on a ThreadScope profile as a big delay during GC while the other cores wait for the delayed core. Make sure your machine is quiet and/or use one fewer cores than the total available. It's not usually a good idea to use hyperthreaded cores either.
I'm also seeing unpredictable performance on a 32-core AMD machine with NUMA. I'd avoid NUMA for Haskell for the time being if you can. Indeed you get unpredictable performance on this machine even for single-threaded code, because it makes a difference on which node the pages of your executable are cached (I heard a rumour that Linux has some kind of a fix for this in the pipeline, but I don't know the details).
I had thought the last core parallel slowdown problem was fixed a while ago, but apparently not?
We improved matters by inserting some "yield"s into the spinlock loops. This helped a lot, but the problem still exists.
Cheers, Simon
Thanks, John
On Tue, Jun 19, 2012 at 8:49 AM, Ben Lippmeier
wrote: On 19/06/2012, at 24:48 , Tyson Whitehead wrote:
On June 18, 2012 04:20:51 John Lato wrote:
Given this, can anyone suggest any likely causes of this issue, or anything I might want to look for? Also, should I be concerned about the much larger gc_alloc_block_sync level for the slow run? Does that indicate the allocator waiting to alloc a new block, or is it something else? Am I on completely the wrong track?
A total shot in the dark here, but wasn't there something about really bad performance when you used all the CPUs on your machine under Linux?
Presumably very tight coupling that is causing all the threads to stall everytime the OS needs to do something or something?
This can be a problem for data parallel computations (like in Repa). In Repa all threads in the gang are supposed to run for the same time, but if one gets swapped out by the OS then the whole gang is stalled.
I tend to get best results using -N7 for an 8 core machine.
It is also important to enable thread affinity (with the -qa) flag.
For a Repa program on an 8 core machine I use +RTS -N7 -qa -qg
Ben.
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
participants (7)
-
Ben Lippmeier
-
Bryan O'Sullivan
-
John Lato
-
Manuel M T Chakravarty
-
Ryan Newton
-
Simon Marlow
-
Tyson Whitehead