Scarry parallel garbage collection behavior (6.12.1, Mac OSX)

Hi, list. This is a long post, sorry. The bottom line: specifying +RTS -Nn where n > number of cores slows the program compiled with 6.12.1 by orders of magnitude under Mac OSX. Turning off parallel garbage collections with -qg resolves the problem. Independent verification is appreciated. I have a very simple program: import Control.Parallel main=do let s = s1 `par` ( s2 `pseq` s1 + s2) s1 = sum [1..10000000] s2 = sum [1..10000005] print s compiled with ghc -O -threaded running it with without any options time ./clocktime produces the following times: real 0m3.779s user 0m3.644s sys 0m0.053s Giving -N option to RTS (or -N2, I have two core macbook) shows decent speedup: 100000060000015 real 0m2.678s user 0m4.762s sys 0m0.043s though increase in processor time by 30% seems suspicious. Things become weird if I increase the number of threads. I couldn't expect any further speedup for this task but what I get is absolutely unexpected: time ./clocktime +RTS -N3 100000060000015 real 0m57.071s user 1m40.910s sys 0m0.921s Almost a minute!! Adding more threads does not help either :) time ./clocktime +RTS -N4 100000060000015 real 1m48.281s user 3m10.832s sys 0m1.590s A bit of playing with Instruments/DTrace shows that runaway garbage collection puts the program on its knees. I the case of -N1 the top time consumers look like this: Self % Total % Self Run % Running % # Self # Samples ms Self Run ms Running Parent % Library Source Path Symbol Name 12.8 12.8 408 clocktime __gmpz_add 10.7 10.7 341 clocktime integer_cmm_plusIntegerzh 10 10 321 clocktime allocateLocal etc. This looks absolutely reasonable. Going to -N2 changes the picture to: Self % Total % Self Run % Running % # Self # Samples ms Self Run ms Running Parent % Library Source Path Symbol Name 15.6 15.6 989 clocktime gcWorkerThread 11.6 11.6 736 clocktime waitForGcThreads 9.5 9.5 603 clocktime scavenge_until_all_done 7.4 7.4 471 clocktime setContextSwitches 6.2 6.2 396 clocktime __gmpz_add -N3 Is awful: Self % Total % Self Run % Running % # Self # Samples ms Self Run ms Running Parent % Library Source Path Symbol Name 51.7 51.7 29514 clocktime gcWorkerThread 25.5 25.5 14591 clocktime scavenge_until_all_done 7.2 7.2 4114 clocktime waitForGcThreads 5.5 5.5 3180 clocktime GarbageCollect 5.2 5.2 2998 clocktime setContextSwitches 0.4 0.4 240 clocktime integer_cmm_plusIntegerzh Pavel

On 19/02/2010 21:26, Pavel Perikov wrote:
This is a long post, sorry. The bottom line: specifying +RTS -Nn where n> number of cores slows the program compiled with 6.12.1 by orders of magnitude under Mac OSX. Turning off parallel garbage collections with -qg resolves the problem. Independent verification is appreciated.
The parallel GC currently requires all cores to synchronise, so if you're trying to use more cores than you actually have, you can expect the cost of that synchronisation to be high. Having said that, the cost is higher than we would like mainly due to the use of spinlocks for synchronisation - spinlocks turned out to be the fastest way to synchronise when all the threads are running, but are quite bad when some threads have been descheduled (as would be the case if you are using more threads than you have cores). So in 6.12.2 there will be some changes that reduce the overhead quite significantly, see http://hackage.haskell.org/trac/ghc/ticket/3758 and http://ghcmutterings.wordpress.com/2010/01/25/yielding-more-improvements-in-... Cheers, Simon

Thanks, Simon. I know I should try to find the existing ticket before posting :) P. On 22.02.2010, at 17:10, Simon Marlow wrote:
On 19/02/2010 21:26, Pavel Perikov wrote:
This is a long post, sorry. The bottom line: specifying +RTS -Nn where n> number of cores slows the program compiled with 6.12.1 by orders of magnitude under Mac OSX. Turning off parallel garbage collections with -qg resolves the problem. Independent verification is appreciated.
The parallel GC currently requires all cores to synchronise, so if you're trying to use more cores than you actually have, you can expect the cost of that synchronisation to be high. Having said that, the cost is higher than we would like mainly due to the use of spinlocks for synchronisation - spinlocks turned out to be the fastest way to synchronise when all the threads are running, but are quite bad when some threads have been descheduled (as would be the case if you are using more threads than you have cores). So in 6.12.2 there will be some changes that reduce the overhead quite significantly, see
http://hackage.haskell.org/trac/ghc/ticket/3758
and
http://ghcmutterings.wordpress.com/2010/01/25/yielding-more-improvements-in-...
Cheers, Simon
participants (2)
-
Pavel Perikov
-
Simon Marlow