Threads and memory management

Dear all, I was wondering what is the current status of the ghc RTS with respect to threading. Is it true that the allocator and deallocator (garbage collector) are still single-threaded? I made this example: import Control.Concurrent import Control.Concurrent.QSemN primes1 = sieve [ 2 .. ] primes2 = 2 : sieve [ 3, 5 .. ] sieve (x : ys) = x : sieve ( filter ( \ y -> 0 < y `mod` x ) ys ) main = do s <- newQSemN 0 forkIO $ do print $ sum $ take 10000 primes1 ; signalQSemN s 1 forkIO $ do print $ sum $ take 10000 primes2 ; signalQSemN s 1 waitQSemN s 2 Between the two computations, I see absolutely no data dependencies. So the naive expectation is that using two threads (in the RTS, on a multicore machine) gives half the execution time. I compiled with ghc-6.10.2 -o Par -O2 Par.hs -threaded and when I run with +RTS -N2 (instead of N1), I get my CPU load at 140 % (instead of 100) but the total run time (wall clock) stays nearly constant (i.e. CPU time goes up). For reference, I'm pretty sure a similar thing happens with the current mono implementation (they use Boehm GC and that does not know about C#/CIL threads ?), and this is preventing multi-core speedups in "straightforward" programs (that have no data dependencies, but lots of object creation, e.g. via (hidden) Enumerators and such.) Well, then, if the two Haskell threads are (nearly) completely independent like the above, it would be better to compile and run two separate executables and have them communicate via the OS (pipe or port). But that shouldn't be! (the OS being better than Haskell) Is there was a way of partitioning the memory (managed by the ghc RTS) in totally independent parts that each have their stand-alone memory management. Of course then all communication had to go via some Control.Concurrent.Chan, but that should be fine, if there is little of them. Well, just some thought. This idea can't be new? Tell me why it couldn't possibly work ... J.W.

Here is some more data. It seems the behaviour depends on 32/64 bit arch? ####################################################### waldmann@master:~/tmp$ uname -a Linux master 2.6.18-6-amd64 #1 SMP Fri Dec 12 05:49:32 UTC 2008 x86_64 GNU/Linux waldmann@master:~/tmp$ time ./Par +RTS -N1 496165411 496165411 real 0m22.580s user 0m22.541s sys 0m0.040s waldmann@master:~/tmp$ time ./Par +RTS -N2 496165411 496165411 real 0m21.259s user 0m26.678s sys 0m0.164s ######################################################## waldmann@box:~/tmp> uname -a Linux box 2.6.27.21-0.1-pae #1 SMP 2009-03-31 14:50:44 +0200 i686 i686 i386 GNU/Linux waldmann@box:~/tmp> time ./Par +RTS -N1 496165411 496165411 real 0m29.802s user 0m29.670s sys 0m0.028s waldmann@box:~/tmp> time ./Par +RTS -N2 496165411 496165411 real 0m11.219s user 0m14.917s sys 0m0.164s ################################################### I don't understand this. The fun thing is that the first machine is Intel(R) Xeon(R) CPU X5365 @ 3.00GHz while the second is a much cheaper model name : Genuine Intel(R) CPU T2050 @ 1.60GHz -- View this message in context: http://www.nabble.com/Threads-and-memory-management-tp23221239p23231596.html Sent from the Haskell - Glasgow-haskell-users mailing list archive at Nabble.com.

On 25/04/2009 13:31, j.waldmann wrote:
Here is some more data. It seems the behaviour depends on 32/64 bit arch?
#######################################################
waldmann@master:~/tmp$ uname -a Linux master 2.6.18-6-amd64 #1 SMP Fri Dec 12 05:49:32 UTC 2008 x86_64 GNU/Linux
waldmann@master:~/tmp$ time ./Par +RTS -N1 496165411 496165411
real 0m22.580s user 0m22.541s sys 0m0.040s waldmann@master:~/tmp$ time ./Par +RTS -N2 496165411 496165411
real 0m21.259s user 0m26.678s sys 0m0.164s
########################################################
waldmann@box:~/tmp> uname -a Linux box 2.6.27.21-0.1-pae #1 SMP 2009-03-31 14:50:44 +0200 i686 i686 i386 GNU/Linux
waldmann@box:~/tmp> time ./Par +RTS -N1 496165411 496165411
real 0m29.802s user 0m29.670s sys 0m0.028s waldmann@box:~/tmp> time ./Par +RTS -N2 496165411 496165411
real 0m11.219s user 0m14.917s sys 0m0.164s
This is a very strange result: the user time should not *decrease*, but rather should stay the same or increase a bit when adding cores. If your program is GC-bound, then using a 32-bit build will improve performance, simply because it is shoveling half as much memory around. Check whether it is GC-bound by using +RTS -sstderr. Anyway, the current situation is that with GHC 6.10.2 there are a lot of performance quirks and bottlenecks with respect to parallel programs, some of which have been squashed in HEAD. Try a recent HEAD snapshot if you can, or wait for 6.12.1. Cheers, Simon

Hello Simon, Monday, April 27, 2009, 2:13:26 PM, you wrote:
This is a very strange result: the user time should not *decrease*, but rather should stay the same or increase a bit when adding cores. If
ability to run two threads simultaneous may increase performance in some scenarios. one example is consumer+producer threads. with 2 OS threads all produced data are consumed immediately, so we live inside L2 cache. with 1 OS thread produced data overflow the cache before threads are switched, making execution much slower -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 27/04/2009 11:29, Bulat Ziganshin wrote:
Hello Simon,
Monday, April 27, 2009, 2:13:26 PM, you wrote:
This is a very strange result: the user time should not *decrease*, but rather should stay the same or increase a bit when adding cores. If
ability to run two threads simultaneous may increase performance in some scenarios. one example is consumer+producer threads. with 2 OS threads all produced data are consumed immediately, so we live inside L2 cache. with 1 OS thread produced data overflow the cache before threads are switched, making execution much slower
Yes, good point. Though I'd be surprised if that was happening here. Cheers, Simon

Thanks for your comments.
Check whether it is GC-bound by using +RTS -sstderr.
Well yes, it does a lot of GC (there's no way for the compiler to optimize away the list of primes) because that was the point of the example: to confirm (or disprove) that GC hurts parallelism (at the moment). INIT time 0.00s ( 0.00s elapsed) MUT time 13.23s ( 7.98s elapsed) GC time 14.12s ( 14.11s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 27.35s ( 22.09s elapsed) %GC time 51.6% (63.9% elapsed)
Try a recent HEAD snapshot if you can, or wait for 6.12.1.
I did with 6.11.20090425 and it coredumps with +RTS -N2 (on x86_64) Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x41001950 (LWP 1007)] 0x000000000044e831 in yieldCapability () Current language: auto; currently asm (gdb) where #0 0x000000000044e831 in yieldCapability () #1 0x000000000042d8d3 in schedule () #2 0x000000000042e485 in workerStart () #3 0x00002b680cdcdfc7 in start_thread () from /lib/libpthread.so.0 #4 0x00002b680d0b25ad in clone () from /lib/libc.so.6 #5 0x0000000000000000 in ?? () but it runs with +RTS -N2 -q1 (I don't know exactly what this does) and the numbers did not change much - down from 22 sec to 21 sec maybe) PS: yes, I confirmed that the OS can run the two "primes" enumerations (as separately compiled executables) in parallel in 6 sec wall time. Best regards, J.W.

On 28/04/2009 17:25, Johannes Waldmann wrote:
Thanks for your comments.
Check whether it is GC-bound by using +RTS -sstderr.
Well yes, it does a lot of GC (there's no way for the compiler to optimize away the list of primes) because that was the point of the example: to confirm (or disprove) that GC hurts parallelism (at the moment).
INIT time 0.00s ( 0.00s elapsed) MUT time 13.23s ( 7.98s elapsed) GC time 14.12s ( 14.11s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 27.35s ( 22.09s elapsed)
%GC time 51.6% (63.9% elapsed)
Try a recent HEAD snapshot if you can, or wait for 6.12.1.
I did with 6.11.20090425 and it coredumps with +RTS -N2 (on x86_64)
That's worrying, but I don't see a core dump here. Here are my results: GHC 6.11.20090429 -N1: INIT time 0.00s ( 0.00s elapsed) MUT time 13.52s ( 13.64s elapsed) GC time 21.25s ( 21.23s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 34.76s ( 34.87s elapsed) GHC 6.11.20090429 -N2 -qg0 -qb: INIT time 0.00s ( 0.00s elapsed) MUT time 14.40s ( 7.21s elapsed) GC time 18.35s ( 9.22s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 32.75s ( 16.44s elapsed) which, if I'm not mistaken, is super-linear speedup :-) Don't forget the -qg0 -qb flags with HEAD, these flags usually give the best parallel GC performance at the moment. For the release this might be the default, I still have to do some more experiments. Cheers, Simon

marlowsd:
On 28/04/2009 17:25, Johannes Waldmann wrote:
Thanks for your comments.
Check whether it is GC-bound by using +RTS -sstderr.
Well yes, it does a lot of GC (there's no way for the compiler to optimize away the list of primes) because that was the point of the example: to confirm (or disprove) that GC hurts parallelism (at the moment).
INIT time 0.00s ( 0.00s elapsed) MUT time 13.23s ( 7.98s elapsed) GC time 14.12s ( 14.11s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 27.35s ( 22.09s elapsed)
%GC time 51.6% (63.9% elapsed)
Try a recent HEAD snapshot if you can, or wait for 6.12.1.
I did with 6.11.20090425 and it coredumps with +RTS -N2 (on x86_64)
That's worrying, but I don't see a core dump here. Here are my results:
GHC 6.11.20090429 -N1:
INIT time 0.00s ( 0.00s elapsed) MUT time 13.52s ( 13.64s elapsed) GC time 21.25s ( 21.23s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 34.76s ( 34.87s elapsed)
GHC 6.11.20090429 -N2 -qg0 -qb:
INIT time 0.00s ( 0.00s elapsed) MUT time 14.40s ( 7.21s elapsed) GC time 18.35s ( 9.22s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 32.75s ( 16.44s elapsed)
which, if I'm not mistaken, is super-linear speedup :-)
Don't forget the -qg0 -qb flags with HEAD, these flags usually give the best parallel GC performance at the moment. For the release this might be the default, I still have to do some more experiments.
I've added this interesting bit of info to the par perf. wiki: http://haskell.org/haskellwiki/Performance/Parallel
participants (5)
-
Bulat Ziganshin
-
Don Stewart
-
j.waldmann
-
Johannes Waldmann
-
Simon Marlow