GHC's parallel garbage collector -- what am I doing wrong?

Hello haskell-cafe, Sorry for this long post, but I can't think of a way to describe and explain the problem in a shorter way. I've (again) a very strange behaviour with the parallel GC and would be glad if someone could either reproduce (and explain) it or provide a solution. A similar but unrelated problem has been described in [1]. EXAMPLE CODE The following demonstration program, which is a much smaller and single-threaded version of my real problem behaves as my real program. It does some number crunching by calculating pi to a definable precision:
-- File Pi.hs -- you need the numbers package from hackage. module Main where import Data.Number.CReal import System.Environment import GHC.Conc
main = do digits <- (read . head) `fmap` getArgs :: IO Int calcPi digits
calcPi digits = showCReal (fromEnum digits) pi `pseq` return ()
Compile it with ghc --make -threaded -O2 Pi.hs -o pi BENCHMARKS On my two-core machine I get the following quite strange and unpredictable results: * Using one thread: $ for i in `seq 1 5`;do time pi 5000 +RTS -N1;done real 0m1.441s user 0m1.390s sys 0m0.020s real 0m1.449s user 0m1.390s sys 0m0.000s real 0m1.399s user 0m1.370s sys 0m0.010s real 0m1.401s user 0m1.380s sys 0m0.000s real 0m1.404s user 0m1.380s sys 0m0.000s * Using two threads, hence the parallel GC is used: for i in `seq 1 5`;do time pi 5000 +RTS -N2;done real 0m2.540s user 0m2.490s sys 0m0.010s real 0m1.527s user 0m1.530s sys 0m0.010s real 0m1.966s user 0m1.900s sys 0m0.010s real 0m5.670s user 0m5.620s sys 0m0.010s real 0m2.966s user 0m2.910s sys 0m0.020s * Using two threads, but disabling the parallel GC: for i in `seq 1 5`;do time pi 5000 +RTS -N2 -qg;done real 0m1.383s user 0m1.380s sys 0m0.010s real 0m1.420s user 0m1.360s sys 0m0.010s real 0m1.406s user 0m1.360s sys 0m0.010s real 0m1.421s user 0m1.380s sys 0m0.000s real 0m1.360s user 0m1.360s sys 0m0.000s THREADSCOPE I've additionally attached the threadscope profile of a really bad run, started with $ time pi 5000 +RTS -N2 -ls real 0m15.594s user 0m15.490s sys 0m0.010s as file pi.pdf FURTHER INFORMATION/QUESTION Just disabling the parallel GC leads to very bad performance in my original code, which forks threads with forkIO and does a lot of communications. Hence, using -qg is not a real option for me. Do I have overlooked some cruical aspect of this problem? If you've read this far, thank you for reading ... this far ;-) Cheers, Michael [1] http://osdir.com/ml/haskell-cafe@haskell.org/2010-02/msg00850.html -- Dipl.-Inf. Michael C. Lesniak University of Kassel Programming Languages / Methodologies Research Group Department of Computer Science and Electrical Engineering Wilhelmshöher Allee 73 34121 Kassel Phone: +49-(0)561-804-6269

Am Montag 01 März 2010 16:59:54 schrieb Michael Lesniak:
I've (again) a very strange behaviour with the parallel GC and would be glad if someone could either reproduce (and explain) it or provide a solution.
Sorry, can't reproduce it: $ for i in `seq 1 5`; do time ./mlPi 5000 +RTS -N1; done 7.54user 0.02system 0:07.57elapsed 99%CPU 7.53user 0.02system 0:07.56elapsed 99%CPU 7.70user 0.01system 0:07.72elapsed 99%CPU 7.55user 0.02system 0:07.57elapsed 99%CPU 7.57user 0.01system 0:07.58elapsed 99%CPU $ for i in `seq 1 5`; do time ./mlPi 5000 +RTS -N2; done 7.76user 0.02system 0:07.68elapsed 101%CPU 7.69user 0.02system 0:07.63elapsed 101%CPU 7.94user 0.04system 0:07.89elapsed 101%CPU 7.72user 0.02system 0:07.64elapsed 101%CPU 7.68user 0.04system 0:07.62elapsed 101%CPU $ for i in `seq 1 5`; do time ./mlPi 5000 +RTS -N2 -qg; done 7.56user 0.02system 0:07.58elapsed 99%CPU 7.58user 0.00system 0:07.58elapsed 99%CPU 7.55user 0.01system 0:07.57elapsed 100%CPU 7.57user 0.02system 0:07.58elapsed 100%CPU 7.56user 0.00system 0:07.57elapsed 99%CPU Seems your system has a problem synchronising the cores for GC.

Hello Daniel,
Sorry, can't reproduce it: That's also very helpful, so thanks for this :-)
Seems your system has a problem synchronising the cores for GC. Correct. Could you please tell me your system configuration (i.e. GHC compiler and OS?)
Cheers, Michael

Am Montag 01 März 2010 18:52:53 schrieb Michael Lesniak:
Hello Daniel,
Sorry, can't reproduce it:
That's also very helpful, so thanks for this :-)
Seems your system has a problem synchronising the cores for GC.
Correct. Could you please tell me your system configuration (i.e. GHC compiler and OS?)
ghc-6.12.1 (btw, I also tried with 6.10.3, there -N2 [and even -N3] are really close to -N1, -N1: 7.57user 0.01system 0:07.58elapsed 100%CPU (+/- 0.02s) -N2: 7.59user 0.03system 0:07.61elapsed 100%CPU (+/- 0.02s) -N3: 7.60user 0.03system 0:07.62elapsed 100%CPU - 8.12user 0.00system 0:07.87elapsed 103%CPU typical -N3 behaviour with 6.12.1: 14.43user 0.03system 0:11.05elapsed 130%CPU ), openSuSE 11.1, $ uname -a Linux linux-mkk1 2.6.27.42-0.1-pae #1 SMP 2010-01-06 16:07:25 +0100 i686 i686 i386 GNU/Linux $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Pentium(R) 4 CPU 3.06GHz stepping : 9 cpu MHz : 3058.856 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc pebs bts pni monitor ds_cpl tm2 cid cx16 xtpr lahf_lm bogomips : 6117.71 clflush size : 64 power management: processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Pentium(R) 4 CPU 3.06GHz stepping : 9 cpu MHz : 3058.856 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 initial apicid : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc pebs bts pni monitor ds_cpl tm2 cid cx16 xtpr lahf_lm bogomips : 6118.12 clflush size : 64 power management:
Cheers, Michael

The parallel GC currently doesn't behave well with concurrent programs that
uses multiple capabilities (aka OS threads), and the behaviour you see is
the known symptom of this.. I believe that Simon Marlow has some fixes in
hand that may go into 6.12.2.
Are you saying that you see two different classes of undesirable
performance, one with -qg and one without? How are your threads in your real
program communicating with each other? We've seen problems there when
there's a lot of contention for e.g. IORefs among thousands of threads.
On Mon, Mar 1, 2010 at 7:59 AM, Michael Lesniak
Hello haskell-cafe,
Sorry for this long post, but I can't think of a way to describe and explain the problem in a shorter way.
I've (again) a very strange behaviour with the parallel GC and would be glad if someone could either reproduce (and explain) it or provide a solution. A similar but unrelated problem has been described in [1].
EXAMPLE CODE The following demonstration program, which is a much smaller and single-threaded version of my real problem behaves as my real program. It does some number crunching by calculating pi to a definable precision:
-- File Pi.hs -- you need the numbers package from hackage. module Main where import Data.Number.CReal import System.Environment import GHC.Conc
main = do digits <- (read . head) `fmap` getArgs :: IO Int calcPi digits
calcPi digits = showCReal (fromEnum digits) pi `pseq` return ()
Compile it with
ghc --make -threaded -O2 Pi.hs -o pi
BENCHMARKS On my two-core machine I get the following quite strange and unpredictable results:
* Using one thread:
$ for i in `seq 1 5`;do time pi 5000 +RTS -N1;done
real 0m1.441s user 0m1.390s sys 0m0.020s
real 0m1.449s user 0m1.390s sys 0m0.000s
real 0m1.399s user 0m1.370s sys 0m0.010s
real 0m1.401s user 0m1.380s sys 0m0.000s
real 0m1.404s user 0m1.380s sys 0m0.000s
* Using two threads, hence the parallel GC is used:
for i in `seq 1 5`;do time pi 5000 +RTS -N2;done
real 0m2.540s user 0m2.490s sys 0m0.010s
real 0m1.527s user 0m1.530s sys 0m0.010s
real 0m1.966s user 0m1.900s sys 0m0.010s
real 0m5.670s user 0m5.620s sys 0m0.010s
real 0m2.966s user 0m2.910s sys 0m0.020s
* Using two threads, but disabling the parallel GC:
for i in `seq 1 5`;do time pi 5000 +RTS -N2 -qg;done
real 0m1.383s user 0m1.380s sys 0m0.010s
real 0m1.420s user 0m1.360s sys 0m0.010s
real 0m1.406s user 0m1.360s sys 0m0.010s
real 0m1.421s user 0m1.380s sys 0m0.000s
real 0m1.360s user 0m1.360s sys 0m0.000s
THREADSCOPE I've additionally attached the threadscope profile of a really bad run, started with
$ time pi 5000 +RTS -N2 -ls
real 0m15.594s user 0m15.490s sys 0m0.010s
as file pi.pdf
FURTHER INFORMATION/QUESTION Just disabling the parallel GC leads to very bad performance in my original code, which forks threads with forkIO and does a lot of communications. Hence, using -qg is not a real option for me.
Do I have overlooked some cruical aspect of this problem? If you've read this far, thank you for reading ... this far ;-)
Cheers, Michael
[1] http://osdir.com/ml/haskell-cafe@haskell.org/2010-02/msg00850.html
-- Dipl.-Inf. Michael C. Lesniak University of Kassel Programming Languages / Methodologies Research Group Department of Computer Science and Electrical Engineering
Wilhelmshöher Allee 73 34121 Kassel
Phone: +49-(0)561-804-6269
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hello Bryan,
The parallel GC currently doesn't behave well with concurrent programs that uses multiple capabilities (aka OS threads), and the behaviour you see is the known symptom of this.. I believe that Simon Marlow has some fixes in hand that may go into 6.12.2.
Are you saying that you see two different classes of undesirable performance, one with -qg and one without? No, I wouldn't say that.
How are your threads in your real program communicating with each other? We've seen problems there when there's a lot of contention for e.g. IORefs among thousands of threads. No, my program uses only MVars and Chan and much less threads, normally as many threads as cores available (doing numbercrunching).
But, as Daniel pointed out, I believe it has something to do with Ubuntu's kernelpatches. I tried another machine with an earlier kernel (2.6.28...) and the same program did not behave as bad. Currently I'm trying some experiements with custom kernels and if I solve this using a compiled kernel I'll blog about this. Cheers, Michael

On 01/03/2010 21:20, Michael Lesniak wrote:
Hello Bryan,
The parallel GC currently doesn't behave well with concurrent programs that uses multiple capabilities (aka OS threads), and the behaviour you see is the known symptom of this.. I believe that Simon Marlow has some fixes in hand that may go into 6.12.2.
It's more correct to say the parallel GC has difficulty when one of its threads is descheduled by the OS, because the other threads just spin waiting for it. Presumably some kernels are more susceptible than others due to differences in scheduling policy, I know they've been fiddling around with this a lot in Linux recently. You typically don't see a problem when there are spare cores, the slowdown manifests when you are trying to use all the cores in your machine, so it affects people on dual-cores quite a lot. This probably explains why I've not been particularly affected by this myself, since I do most of my benchmarking on an 8-core box. The fix that will be in 6.12.2 is to insert some yields, so that threads will yield rather than spinning indefinitely, and this seems to help a lot. Cheers, Simon

On Mar 3, 2010, at 8:44 AM, Simon Marlow wrote:
On 01/03/2010 21:20, Michael Lesniak wrote:
Hello Bryan,
The parallel GC currently doesn't behave well with concurrent programs that uses multiple capabilities (aka OS threads), and the behaviour you see is the known symptom of this.. I believe that Simon Marlow has some fixes in hand that may go into 6.12.2.
It's more correct to say the parallel GC has difficulty when one of its threads is descheduled by the OS, because the other threads just spin waiting for it. Presumably some kernels are more susceptible than others due to differences in scheduling policy, I know they've been fiddling around with this a lot in Linux recently.
You typically don't see a problem when there are spare cores, the slowdown manifests when you are trying to use all the cores in your machine, so it affects people on dual-cores quite a lot. This probably explains why I've not been particularly affected by this myself, since I do most of my benchmarking on an 8-core box.
The fix that will be in 6.12.2 is to insert some yields, so that threads will yield rather than spinning indefinitely, and this seems to help a lot.
Be warned that inserting yield into a spin loop is also non-portable, and may make the problem *worse* on some systems. The problem is that "yield" calls can be taken by the scheduler to mean "See, I'm a nice thread, giving up the core when I don't need it. Please give me extra Scheduling Dubloons." Now let's say 7 of your 8 threads are doing this. It's likely that each one will yield to the next, and the 8th thread (the one you actually want on-processor) could take a long time to bubble up and get its moment. At one time on Solaris you could even livelock (because the scheduler didn't try particularly hard to be fair in the case of multiple yielding threads in a single process)---but that was admittedly a long time ago. The only recourse I know about is to tell the OS you're doing synchronization (by using OS-visible locking calls, say the ones in pthreads or some of the lightweight calls that Linux has added for the purpose). Obviously this has a cost if anyone falls out of the spin loop---and it's pretty likely some thread will have to wait a while. -Jan-Willem Maessen
Cheers, Simon _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 07/03/10 14:41, Jan-Willem Maessen wrote:
On Mar 3, 2010, at 8:44 AM, Simon Marlow wrote:
On 01/03/2010 21:20, Michael Lesniak wrote:
Hello Bryan,
The parallel GC currently doesn't behave well with concurrent programs that uses multiple capabilities (aka OS threads), and the behaviour you see is the known symptom of this.. I believe that Simon Marlow has some fixes in hand that may go into 6.12.2.
It's more correct to say the parallel GC has difficulty when one of its threads is descheduled by the OS, because the other threads just spin waiting for it. Presumably some kernels are more susceptible than others due to differences in scheduling policy, I know they've been fiddling around with this a lot in Linux recently.
You typically don't see a problem when there are spare cores, the slowdown manifests when you are trying to use all the cores in your machine, so it affects people on dual-cores quite a lot. This probably explains why I've not been particularly affected by this myself, since I do most of my benchmarking on an 8-core box.
The fix that will be in 6.12.2 is to insert some yields, so that threads will yield rather than spinning indefinitely, and this seems to help a lot.
Be warned that inserting yield into a spin loop is also non-portable, and may make the problem *worse* on some systems.
The problem is that "yield" calls can be taken by the scheduler to mean "See, I'm a nice thread, giving up the core when I don't need it. Please give me extra Scheduling Dubloons."
Now let's say 7 of your 8 threads are doing this. It's likely that each one will yield to the next, and the 8th thread (the one you actually want on-processor) could take a long time to bubble up and get its moment. At one time on Solaris you could even livelock (because the scheduler didn't try particularly hard to be fair in the case of multiple yielding threads in a single process)---but that was admittedly a long time ago.
How depressing, thanks for that :)
The only recourse I know about is to tell the OS you're doing synchronization (by using OS-visible locking calls, say the ones in pthreads or some of the lightweight calls that Linux has added for the purpose). Obviously this has a cost if anyone falls out of the spin loop---and it's pretty likely some thread will have to wait a while.
Yes, so we tried using futexes on Linux, there's an experimental patch attached to http://hackage.haskell.org/trac/ghc/ticket/3553 it was definitely slower than the spinlocks on the benchmarks I tried. I think the problem is that we're using these spinlocks to synchronise across all cores, and it's likely that these loops will have to spin for a while before exiting becuase one or more of the running cores takes a while to get to a safe point. But really giving up the core and blocking is a lot worse, becuas the wakeup time is so long (you can see it pretty clearly in ThreadScope). Anyway, I hope all this is just a temporary problem until we get CPU-independent GC working. Cheers, Simon
participants (5)
-
Bryan O'Sullivan
-
Daniel Fischer
-
Jan-Willem Maessen
-
Michael Lesniak
-
Simon Marlow