testing par with simple program

Hi, reading a previous thread I got interested. I simplified the example pointed by dons in import Control.Parallel main = a `par` b `pseq` print (a + b ) where a = ack 3 11 b = ack 3 11 ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1)) compiled with ghc --make prova -O2 -threaded timings paolino@paolino-casa:~$ time ./prova +RTS -N1 32762 real 0m7.031s user 0m6.304s sys 0m0.004s paolino@paolino-casa:~$ time ./prova +RTS -N2 32762 real 0m6.997s user 0m6.728s sys 0m0.020s paolino@paolino-casa:~$ without optimizations it gets worse paolino@paolino-casa:~$ time ./prova +RTS -N1 32762 real 1m20.706s user 1m18.197s sys 0m0.104s paolino@paolino-casa:~$ time ./prova +RTS -N2 32762 real 1m38.927s user 1m45.039s sys 0m0.536s paolino@paolino-casa:~$ staring at the resource usage graph I can see it does use 2 cores when told to do it, but with -N1 the used cpu goes 100% and with -N2 they both run just over 50% thanks for comments paolino

paolo.veronelli:
Hi, reading a previous thread I got interested. I simplified the example pointed by dons in
import Control.Parallel
main = a `par` b `pseq` print (a + b ) where a = ack 3 11 b = ack 3 11
ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1))
compiled with ghc --make prova -O2 -threaded
timings paolino@paolino-casa:~$ time ./prova +RTS -N1 32762
real 0m7.031s user 0m6.304s sys 0m0.004s paolino@paolino-casa:~$ time ./prova +RTS -N2 32762
real 0m6.997s user 0m6.728s sys 0m0.020s paolino@paolino-casa:~$
without optimizations it gets worse
paolino@paolino-casa:~$ time ./prova +RTS -N1 32762
real 1m20.706s user 1m18.197s sys 0m0.104s paolino@paolino-casa:~$ time ./prova +RTS -N2 32762
real 1m38.927s user 1m45.039s sys 0m0.536s paolino@paolino-casa:~$
staring at the resource usage graph I can see it does use 2 cores when told to do it, but with -N1 the used cpu goes 100% and with -N2 they both run just over 50%
thanks for comments
Firstly, a and b are identical, so GHC commons them up. The compiler transforms it into: a `par` a `seq` print (a + a) So you essentially fork a spark to evaluate 'a', and then have the main thread also evaluate 'a' again. One of them wins, then you add the result to itself. The runtime may choose not to convert your first spark into a thread. Running with a 2009 GHC head snapshot, we can see with +RTS -sstderr SPARKS: 1 (0 converted, 0 pruned) That indeed, it doesn't convert your `par` into a real thread. While, for example, the helloworld on the wiki: http://haskell.org/haskellwiki/Haskell_in_5_steps Converts 2 sparks to 2 theads: SPARKS: 2 (2 converted, 0 pruned) ./B +RTS -threaded -N2 -sstderr 2.13s user 0.04s system 137% cpu 1.570 total -- Don

A better test program import Control.Parallel main = a `par` b `pseq` print (a + b ) where a = ack 3 11 b = fib 39 ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1)) fib 0 = 0 fib 1 = 1 fib n = fib (n-1) + fib (n-2) running it , these are the results paolino@paolino-casa:~$ ghc --make prova -threaded [1 of 1] Compiling Main ( prova.hs, prova.o ) Linking prova ... paolino@paolino-casa:~$ time ./prova +RTS -N1 63262367 real 1m17.485s user 1m16.473s sys 0m0.392s paolino@paolino-casa:~$ time ./prova +RTS -N2 63262367 real 1m20.186s user 1m31.554s sys 0m0.600s paolino@paolino-casa:~$ touch prova.hs paolino@paolino-casa:~$ ghc --make prova -O2 -threaded [1 of 1] Compiling Main ( prova.hs, prova.o ) Linking prova ... paolino@paolino-casa:~$ time ./prova +RTS -N1 63262367 real 0m17.652s user 0m15.277s sys 0m0.108s paolino@paolino-casa:~$ time ./prova +RTS -N2 63262367 real 0m13.650s user 0m15.121s sys 0m0.188s
From the resource graph and the timings it is clear that the program is not able to use all the 2 cores powers, considering computing 'a' alone is about 7 seconds and 'b' alone 9. What is retaining the cpu's to run full power ?
paolino
2009/8/21 Don Stewart
paolo.veronelli:
Hi, reading a previous thread I got interested. I simplified the example pointed by dons in
import Control.Parallel
main = a `par` b `pseq` print (a + b ) where a = ack 3 11 b = ack 3 11
ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1))
compiled with ghc --make prova -O2 -threaded
timings paolino@paolino-casa:~$ time ./prova +RTS -N1 32762
real 0m7.031s user 0m6.304s sys 0m0.004s paolino@paolino-casa:~$ time ./prova +RTS -N2 32762
real 0m6.997s user 0m6.728s sys 0m0.020s paolino@paolino-casa:~$
without optimizations it gets worse
paolino@paolino-casa:~$ time ./prova +RTS -N1 32762
real 1m20.706s user 1m18.197s sys 0m0.104s paolino@paolino-casa:~$ time ./prova +RTS -N2 32762
real 1m38.927s user 1m45.039s sys 0m0.536s paolino@paolino-casa:~$
staring at the resource usage graph I can see it does use 2 cores when told to do it, but with -N1 the used cpu goes 100% and with -N2 they both run just over 50%
thanks for comments
Firstly, a and b are identical, so GHC commons them up. The compiler transforms it into:
a `par` a `seq` print (a + a)
So you essentially fork a spark to evaluate 'a', and then have the main thread also evaluate 'a' again. One of them wins, then you add the result to itself. The runtime may choose not to convert your first spark into a thread.
Running with a 2009 GHC head snapshot, we can see with +RTS -sstderr
SPARKS: 1 (0 converted, 0 pruned)
That indeed, it doesn't convert your `par` into a real thread.
While, for example, the helloworld on the wiki:
http://haskell.org/haskellwiki/Haskell_in_5_steps
Converts 2 sparks to 2 theads:
SPARKS: 2 (2 converted, 0 pruned) ./B +RTS -threaded -N2 -sstderr 2.13s user 0.04s system 137% cpu 1.570 total
-- Don

paolo.veronelli:
A better test program
import Control.Parallel
main = a `par` b `pseq` print (a + b ) where a = ack 3 11 b = fib 39
ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1))
fib 0 = 0 fib 1 = 1 fib n = fib (n-1) + fib (n-2)
running it , these are the results
From the resource graph and the timings it is clear that the program is not able to use all the 2 cores powers, considering computing 'a' alone is about 7 seconds and 'b' alone 9. What is retaining the cpu's to run full power ?
Some advice on analysing parallel programs: * Read Simon Marlow's excellent paper: "Runtime Support for Multicore Haskell" http://ghcmutterings.wordpress.com/2009/03/03/new-paper-runtime-support-for-... * Install GHC Head from 2009, it includes spark profiling. * Never bother measuring without using -O or -O2 -- unoptimized code is just not useful to think about. * Compile your program: $ ghc-6.11.20090228 -O2 --make -threaded A.hs [1 of 1] Compiling Main ( A.hs, A.o ) Linking A ... * Run it with +RTS -sstderr $ ./A +RTS -N2 -sstderr ./A +RTS -N2 -sstderr 63262367 ... Generation 0: 12021 collections, 0 parallel, 1.33s, 1.42s elapsed Generation 1: 2 collections, 1 parallel, 0.00s, 0.00s elapsed Parallel GC work balance: 1.03 (426 / 414, ideal 2) SPARKS: 1 (1 converted, 0 pruned) Productivity 90.7% of total user, 126.4% of total elapsed And we see 3 things: * It did use the parallel GC : good! * It did convert 1 spark to a thread: good! * Productivity was greater than 100%: good! So this looks fine! So I suspect you're falling into the ghc 6.10.x series bug where < 3 sparks would not be converted. Try with either 3 sparks, or using GHC 6.11.x series. -- Don
participants (2)
-
Don Stewart
-
Paolino