[Haskell-cafe] Re: Parallel Pi

19 Mar 2010

      Am Donnerstag 18 März 2010 22:44:55 schrieb Simon Marlow:
...
On 17/03/10 21:30, Daniel Fischer wrote:
...
Am Mittwoch 17 März 2010 19:49:57 schrieb Artyom Kazak:
...
Hello!
I tried to implement the parallel Monte-Carlo method of computing Pi
number, using two cores:
<move>
...
But it uses only on core:
<snip>
...
We see that our one spark is pruned. Why?
Well, the problem is that your tasks don't do any real work - yet.
piMonte returns a thunk pretty immediately, that thunk is then
evaluated by show, long after your chance for parallelism is gone. You
must force the work to be done _in_ r1 and r2, then you get
parallelism:
Generation 0:  2627 collections,  2626 parallel,  0.14s,  0.12s
elapsed Generation 1:     1 collections,     1 parallel,  0.00s, 
0.00s elapsed
Parallel GC work balance: 1.79 (429262 / 240225, ideal 2)
MUT time (elapsed)       GC time  (elapsed)
   Task  0 (worker) :    0.00s    (  8.22s)       0.00s    (  0.00s)
   Task  1 (worker) :    8.16s    (  8.22s)       0.01s    (  0.01s)
   Task  2 (worker) :    8.00s    (  8.22s)       0.13s    (  0.11s)
   Task  3 (worker) :    0.00s    (  8.22s)       0.00s    (  0.00s)
SPARKS: 1 (1 converted, 0 pruned)
INIT  time    0.00s  (  0.00s elapsed)
   MUT   time   16.14s  (  8.22s elapsed)
   GC    time    0.14s  (  0.12s elapsed)
   EXIT  time    0.00s  (  0.00s elapsed)
   Total time   16.29s  (  8.34s elapsed)
%GC time       0.9%  (1.4% elapsed)
Alloc rate    163,684,377 bytes per MUT second
Productivity  99.1% of total user, 193.5% of total elapsed
But alas, it is slower than the single-threaded calculation :(
INIT  time    0.00s  (  0.00s elapsed)
   MUT   time    7.08s  (  7.10s elapsed)
   GC    time    0.08s  (  0.08s elapsed)
   EXIT  time    0.00s  (  0.00s elapsed)
   Total time    7.15s  (  7.18s elapsed)
It works for me (GHC 6.12.1):
SPARKS: 1 (1 converted, 0 pruned)
INIT  time    0.00s  (  0.00s elapsed)
   MUT   time    9.05s  (  4.54s elapsed)
   GC    time    0.12s  (  0.09s elapsed)
   EXIT  time    0.00s  (  0.01s elapsed)
   Total time    9.12s  (  4.63s elapsed)
wall-clock speedup of 1.93 on 2 cores.
Is that Artyom's original code or with the pseq'ed length?
The original didn't convert any sparks for me (~103% cpu, because of 
parallel GC, but the calculation always used just one thread).
I'm also using 6.12.1.

And, with -N2, I also have a productivity of 193.5%, but the elapsed time 
is larger than the elapsed time for -N1. How long does it take with -N1 for 
you?

It's the same with 6.10.3, no converted sparks for the original code, 
parallelism with the pseq'ed length, but it takes longer than with -N1.
...
What hardware are you using there?
3.06GHz Pentium 4, 2 cores.
I have mixed results with parallelism, some programmes get a speed-up of 
nearly a factor 2 (wall-clock time), others 1.4, 1.5 or so, yet others take 
about the same wall-clock time as the single threaded programme, some - 
like this - take longer despite using both cores intensively.
...
Have you tried changing any GC settings?
I've played around a little with -qg and -qb and -C, but that showed little 
influence. Any tips what else might be worth a try?
...
Cheers,
  Simon