testing par with simple program

older
Keeping an indexed collection of...

Paolino

21 Aug 2009 21 Aug '09

5:21 p.m.

Hi, reading a previous thread I got interested. I simplified the example pointed by dons in import Control.Parallel main = a `par` b `pseq` print (a + b ) where a = ack 3 11 b = ack 3 11 ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1)) compiled with ghc --make prova -O2 -threaded timings paolino@paolino-casa:~$ time ./prova +RTS -N1 32762 real 0m7.031s user 0m6.304s sys 0m0.004s paolino@paolino-casa:~$ time ./prova +RTS -N2 32762 real 0m6.997s user 0m6.728s sys 0m0.020s paolino@paolino-casa:~$ without optimizations it gets worse paolino@paolino-casa:~$ time ./prova +RTS -N1 32762 real 1m20.706s user 1m18.197s sys 0m0.104s paolino@paolino-casa:~$ time ./prova +RTS -N2 32762 real 1m38.927s user 1m45.039s sys 0m0.536s paolino@paolino-casa:~$ staring at the resource usage graph I can see it does use 2 cores when told to do it, but with -N1 the used cpu goes 100% and with -N2 they both run just over 50% thanks for comments paolino

Attachments:

attachment.html (text/html — 1.2 KB)

Show replies by date

Don Stewart

21 Aug 21 Aug

5:37 p.m.

paolo.veronelli:

...

Hi, reading a previous thread I got interested. I simplified the example pointed by dons in

import Control.Parallel

main = a `par` b `pseq` print (a + b ) where a = ack 3 11 b = ack 3 11

ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1))

compiled with ghc --make prova -O2 -threaded

timings paolino@paolino-casa:~$ time ./prova +RTS -N1 32762

real 0m7.031s user 0m6.304s sys 0m0.004s paolino@paolino-casa:~$ time ./prova +RTS -N2 32762

real 0m6.997s user 0m6.728s sys 0m0.020s paolino@paolino-casa:~$

without optimizations it gets worse

paolino@paolino-casa:~$ time ./prova +RTS -N1 32762

real 1m20.706s user 1m18.197s sys 0m0.104s paolino@paolino-casa:~$ time ./prova +RTS -N2 32762

real 1m38.927s user 1m45.039s sys 0m0.536s paolino@paolino-casa:~$

staring at the resource usage graph I can see it does use 2 cores when told to do it, but with -N1 the used cpu goes 100% and with -N2 they both run just over 50%

thanks for comments

Firstly, a and b are identical, so GHC commons them up. The compiler transforms it into: a `par` a `seq` print (a + a) So you essentially fork a spark to evaluate 'a', and then have the main thread also evaluate 'a' again. One of them wins, then you add the result to itself. The runtime may choose not to convert your first spark into a thread. Running with a 2009 GHC head snapshot, we can see with +RTS -sstderr SPARKS: 1 (0 converted, 0 pruned) That indeed, it doesn't convert your `par` into a real thread. While, for example, the helloworld on the wiki: http://haskell.org/haskellwiki/Haskell_in_5_steps Converts 2 sparks to 2 theads: SPARKS: 2 (2 converted, 0 pruned) ./B +RTS -threaded -N2 -sstderr 2.13s user 0.04s system 137% cpu 1.570 total -- Don

Paolino

6:17 p.m.

A better test program import Control.Parallel main = a `par` b `pseq` print (a + b ) where a = ack 3 11 b = fib 39 ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1)) fib 0 = 0 fib 1 = 1 fib n = fib (n-1) + fib (n-2) running it , these are the results paolino@paolino-casa:~$ ghc --make prova -threaded [1 of 1] Compiling Main ( prova.hs, prova.o ) Linking prova ... paolino@paolino-casa:~$ time ./prova +RTS -N1 63262367 real 1m17.485s user 1m16.473s sys 0m0.392s paolino@paolino-casa:~$ time ./prova +RTS -N2 63262367 real 1m20.186s user 1m31.554s sys 0m0.600s paolino@paolino-casa:~$ touch prova.hs paolino@paolino-casa:~$ ghc --make prova -O2 -threaded [1 of 1] Compiling Main ( prova.hs, prova.o ) Linking prova ... paolino@paolino-casa:~$ time ./prova +RTS -N1 63262367 real 0m17.652s user 0m15.277s sys 0m0.108s paolino@paolino-casa:~$ time ./prova +RTS -N2 63262367 real 0m13.650s user 0m15.121s sys 0m0.188s

...

From the resource graph and the timings it is clear that the program is not able to use all the 2 cores powers, considering computing 'a' alone is about 7 seconds and 'b' alone 9. What is retaining the cpu's to run full power ?

paolino 2009/8/21 Don Stewart

...

paolo.veronelli:

...
Hi, reading a previous thread I got interested. I simplified the example pointed by dons in

import Control.Parallel

main = a `par` b `pseq` print (a + b ) where a = ack 3 11 b = ack 3 11

ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1))

compiled with ghc --make prova -O2 -threaded

timings paolino@paolino-casa:~$ time ./prova +RTS -N1 32762

real 0m7.031s user 0m6.304s sys 0m0.004s paolino@paolino-casa:~$ time ./prova +RTS -N2 32762

real 0m6.997s user 0m6.728s sys 0m0.020s paolino@paolino-casa:~$

without optimizations it gets worse

paolino@paolino-casa:~$ time ./prova +RTS -N1 32762

real 1m20.706s user 1m18.197s sys 0m0.104s paolino@paolino-casa:~$ time ./prova +RTS -N2 32762

real 1m38.927s user 1m45.039s sys 0m0.536s paolino@paolino-casa:~$

staring at the resource usage graph I can see it does use 2 cores when told to do it, but with -N1 the used cpu goes 100% and with -N2 they both run just over 50%

thanks for comments

Firstly, a and b are identical, so GHC commons them up. The compiler transforms it into:

a `par` a `seq` print (a + a)

So you essentially fork a spark to evaluate 'a', and then have the main thread also evaluate 'a' again. One of them wins, then you add the result to itself. The runtime may choose not to convert your first spark into a thread.

Running with a 2009 GHC head snapshot, we can see with +RTS -sstderr

SPARKS: 1 (0 converted, 0 pruned)

That indeed, it doesn't convert your `par` into a real thread.

While, for example, the helloworld on the wiki:

http://haskell.org/haskellwiki/Haskell_in_5_steps

Converts 2 sparks to 2 theads:

SPARKS: 2 (2 converted, 0 pruned) ./B +RTS -threaded -N2 -sstderr 2.13s user 0.04s system 137% cpu 1.570 total

-- Don

Don Stewart

6:30 p.m.

paolo.veronelli:

...

A better test program

import Control.Parallel

main = a `par` b `pseq` print (a + b ) where a = ack 3 11 b = fib 39

ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1))

fib 0 = 0 fib 1 = 1 fib n = fib (n-1) + fib (n-2)

running it , these are the results

From the resource graph and the timings it is clear that the program is not able to use all the 2 cores powers, considering computing 'a' alone is about 7 seconds and 'b' alone 9. What is retaining the cpu's to run full power ?

Some advice on analysing parallel programs: * Read Simon Marlow's excellent paper: "Runtime Support for Multicore Haskell" http://ghcmutterings.wordpress.com/2009/03/03/new-paper-runtime-support-for-... * Install GHC Head from 2009, it includes spark profiling. * Never bother measuring without using -O or -O2 -- unoptimized code is just not useful to think about. * Compile your program: $ ghc-6.11.20090228 -O2 --make -threaded A.hs [1 of 1] Compiling Main ( A.hs, A.o ) Linking A ... * Run it with +RTS -sstderr $ ./A +RTS -N2 -sstderr ./A +RTS -N2 -sstderr 63262367 ... Generation 0: 12021 collections, 0 parallel, 1.33s, 1.42s elapsed Generation 1: 2 collections, 1 parallel, 0.00s, 0.00s elapsed Parallel GC work balance: 1.03 (426 / 414, ideal 2) SPARKS: 1 (1 converted, 0 pruned) Productivity 90.7% of total user, 126.4% of total elapsed And we see 3 things: * It did use the parallel GC : good! * It did convert 1 spark to a thread: good! * Productivity was greater than 100%: good! So this looks fine! So I suspect you're falling into the ghc 6.10.x series bug where < 3 sparks would not be converted. Try with either 3 sparks, or using GHC 6.11.x series. -- Don

5978

Age (days ago)

5978

Last active (days ago)

List overview

Download

3 comments

2 participants

participants (2)

Don Stewart
Paolino