
Hi all, I'm seeing no performance increase with a simple coarse-grained 2-thread code using Control.Concurrent. I compile with:
hc conc.hs -o conc --make -threaded
and I run with
time ./conc +RTS -N2
But using either "-N1" or "-N2", the program runs in about 1.8secs. (I'd prefer a longer running thread task, but my fib function currently runs out of memory). Anyway, my program is below, and I'm using GHC version 6.8.2 on a 2-core Pentium D. Can anyone help? module Main where import Control.Concurrent fibs = 0 : 1 : zipWith (+) fibs (tail fibs) heavytask m = putMVar m (fibs !! 100000) main = do ms <- sequence $ replicate 2 newEmptyMVar mapM_ (forkIO . heavytask) $ tail ms heavytask $ head ms ms' <- mapM takeMVar ms mapM_ print ms' Regards, Paul

On Fri, Dec 19, 2008 at 2:27 PM, Paul Keir
I'm seeing no performance increase with a simple coarse-grained 2-thread code using Control.Concurrent. I compile with:
I didn't test your code, but [...]
fibs = 0 : 1 : zipWith (+) fibs (tail fibs) heavytask m = putMVar m (fibs !! 100000) [...]
probably fibs is being calculated only once, so just one thread calculates (fibs !! 100000) while others just keep waiting for the result. -- Felipe.

pkeir:
Hi all,
I'm seeing no performance increase with a simple coarse-grained 2-thread code using Control.Concurrent. I compile with:
hc conc.hs -o conc --make -threaded
Also, if you care about performance in the slightest , please use -O2 Code is typically 10-30x faster with optimisations on. -- Don

On Fri, Dec 19, 2008 at 9:27 AM, Paul Keir
Hi all,
I'm seeing no performance increase with a simple coarse-grained 2-thread code using Control.Concurrent. I compile with:
hc conc.hs -o conc --make -threaded
and I run with
time ./conc +RTS -N2
But using either "-N1" or "-N2", the program runs in about 1.8secs. (I'd prefer a longer running thread task, but my fib function currently runs out of memory).
Anyway, my program is below, and I'm using GHC version 6.8.2 on a 2-core Pentium D. Can anyone help?
module Main where
import Control.Concurrent
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
This is a serial algorithm. No matter what the rest of your program is doing, this calculation is sequential, so you will not see a speedup. Parallelizing computation of the fibonacci numbers is reasonably tricky. I think you might be able to do it using the fib(2n) identity. But not this simple algorithm: throwing processors at a problem does not automatically make it parallel. :-)
heavytask m = putMVar m (fibs !! 100000)
main = do ms <- sequence $ replicate 2 newEmptyMVar mapM_ (forkIO . heavytask) $ tail ms heavytask $ head ms ms' <- mapM takeMVar ms mapM_ print ms'
Regards, Paul _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Paul Keir wrote:
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
This is a CAF (Constant Applicative Form). Since it is actually a constant it is never garbage collected, and is always shared, so each thread is only calculating it once. You have essentially created a lookup table. - - Jake -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAklLzvIACgkQye5hVyvIUKkaBACfTJfoWokgzmkyN8wm8zIeGc89 UcwAoK2VR8c0zCs0P6XTmAaJcN8oaDYc =9Yu/ -----END PGP SIGNATURE-----

On Fri, 2008-12-19 at 10:42 -0600, Jake McArthur wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Paul Keir wrote:
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
This is a CAF (Constant Applicative Form). Since it is actually a constant it is never garbage collected, and is always shared, so each thread is only calculating it once. You have essentially created a lookup table.
Though note that with all our obvious suggestions there is still no speedup: heavytask m n = putMVar m $! (fibs !! 100000) where fibs = n : (n+1) : zipWith (+) fibs (tail fibs) -- so now fibs is not globally shared but is used per-heavytask -- it is also evaluated by heavy task rather than just putting a thunk -- into the MVar main = do ms <- sequence $ replicate 8 newEmptyMVar sequence_ [ forkIO (heavytask m n) | (m, n) <- zip ms [0..] ] ms' <- mapM takeMVar ms mapM_ print ms' Looking at the GC stats (+RTS -t -RTS) we see that the majority of the time in this program is spent doing GC and that when we run with -N4 the time spent doing GC is even higher. -N1 1.57 MUT (1.60 elapsed), 7.05 GC (7.16 elapsed) real 0m8.793s -N2 2.50 MUT (1.49 elapsed), 8.48 GC (7.33 elapsed) real 0m8.873s -N4 2.83 MUT (1.56 elapsed), 12.16 GC (7.95 elapsed) real 0m9.572s The process monitor indicates that in the -N1 case, one core hits 100% use for the full 8 seconds. In the -N2 case one core is hitting 90% utilisation with the other three cores doing a little work, up to about 40% utilisation. On some runs the core doing the most work swaps over. In one run at -N2 I got a segmentation fault. In the -N4 case, 4 cores hit between 30% and 80% utilisation. So this benchmark is primarily a stress test of the parallel garbage collector since it is GC that is taking 75-80% of the time. Note that the mutator elapsed time goes down slightly with 2 cores compared to 1 however the GC elapsed time goes up slightly. Duncan

So this benchmark is primarily a stress test of the parallel garbage collector since it is GC that is taking 75-80% of the time. Note that the mutator elapsed time goes down slightly with 2 cores compared to 1 however the GC elapsed time goes up slightly.
Thanks Duncan, Jake et al. I'm more familiar with MPI and OpenMP for parallelism; it seems I've got a lot more thinking to do when it comes to Haskell. I'll look at some more tutorials, and then most likely Data Parallel Haskell. Cheers, Paul

Duncan Coutts wrote:
On Fri, 2008-12-19 at 10:42 -0600, Jake McArthur wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
fibs = 0 : 1 : zipWith (+) fibs (tail fibs) This is a CAF (Constant Applicative Form). Since it is actually a constant it is never garbage collected, and is always shared, so each
Paul Keir wrote: thread is only calculating it once. You have essentially created a lookup table.
Though note that with all our obvious suggestions there is still no speedup:
heavytask m n = putMVar m $! (fibs !! 100000) where fibs = n : (n+1) : zipWith (+) fibs (tail fibs)
-- so now fibs is not globally shared but is used per-heavytask -- it is also evaluated by heavy task rather than just putting a thunk -- into the MVar
main = do ms <- sequence $ replicate 8 newEmptyMVar sequence_ [ forkIO (heavytask m n) | (m, n) <- zip ms [0..] ] ms' <- mapM takeMVar ms mapM_ print ms'
Looking at the GC stats (+RTS -t -RTS) we see that the majority of the time in this program is spent doing GC and that when we run with -N4 the time spent doing GC is even higher.
This is an interesting example. It shows up a weakness in the GC that I'm working on fixing right now. The interesting aspect of this example is that the thread stacks get large. You can see this by using +RTS -hT: a large chunk of the heap is taken up by TSOs. Each of those (fibs !! 100000) requires linear stack, because (fibs 100000) depends on (fibs 99999), and so on. That could probably be fixed by adding some strictness, but that's not the goal here - we should still be able to run the program in parallel. So when there are large stacks around, GC takes a long time because it has to traverse stacks. But we should be able to alleviate the problem by (a) using a larger heap, and (b) using parallel GC. Not doing parallel GC is seriously going to hurt peformance, because the data will have to be moved from one CPU's cache to another. But it turns out that parallel GC is misbehaving on this example too, because it doesn't force each stack to be scanned by the same thread that is executing it - I'm working on fixing that. Having each CPU be able to GC independently would be a big improvement, of course. We think we understand how this can be done in the context of GHC, it's just a matter of doing it, but it's a big job. Parallel performance is something that we expect to make dramatic improvements over the next few months as we investigate more programs and improve the tools. The current HEAD is already a lot better than 6.10.1. Cheers, Simon

Hi Duncan, I'm following the story regarding (parallel) GC in this example with interest, but forgive me if I ask a more minor question regarding your modification of an extra parameter, "n", to "heavytask". Does this really help (to ensure that each core does work independently)? Surely, with fibs now described in a where clause, the "0:1:etc." form would not be shared among the (8) instantiations of "heavytask"?
heavytask m n = putMVar m $! (fibs !! 100000) where fibs = n : (n+1) : zipWith (+) fibs (tail fibs)
Regards, Paul

On Tue, 2008-12-23 at 18:27 +0000, Paul Keir wrote:
Hi Duncan,
I'm following the story regarding (parallel) GC in this example with interest, but forgive me if I ask a more minor question regarding your modification of an extra parameter, "n", to "heavytask". Does this really help (to ensure that each core does work independently)? Surely, with fibs now described in a where clause, the "0:1:etc." form would not be shared among the (8) instantiations of "heavytask"?
Yes, that was the purpose of the modification, to ensure that the value was not shared but calculated independently. I wanted to test speedup with a trivially parallel workload. If the value really is shared then there will necessarily be no speedup. That is because the definition of this value does not use any explicit parallelism and ghc does not do any automatic parallelisation. Demanding the same evaluation from multiple threads just causes the other threads to block awaiting evaluation.
heavytask m n = putMVar m $! (fibs !! 100000) where fibs = n : (n+1) : zipWith (+) fibs (tail fibs)
To get parallel speedup when evaluating a single value like this we need a different definition. We could use IO threads with forkIO and explicit shared variables or message passing. However since this is a pure function a much nicer approach is to use a pure implementation using the `par` operator: http://darcs.haskell.org/nofib/parallel/parfib/Main.hs -- parallel version of the code with thresholding parfib :: Int -> Int -> Int parfib n t | n <= t = nfib n | otherwise = n1 `par` (n2 `pseq` n1 + n2 + 1) where n1 = parfib (n-1) t n2 = parfib (n-2) t Using `par` creates "sparks" which are even lighter weight than Haskell IO threads. They are just values/thunks and get evaluated by a pool of ordinary Haskell threads. Duncan

On Fri, Dec 19, 2008 at 9:27 AM, Paul Keir
module Main where
import Control.Concurrent
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
heavytask m = putMVar m (fibs !! 100000)
Oh, also, heavytask is not very heavy at all. It just writes the thunk (fibs !! 100000) into the MVar. Not a single number is added in this thread. You probably meant to have the thread evaluate its argument _before_ writing it to the variable: heavytask m = putMVar m $! (fibs !! 100000) (Or more transparently) heavytask m = let answer = fibs !! 100000 in answer `seq` putMVar m answer But as per my other comments, you will not see a speedup (in fact, you will probably see some slowdown as two threads compete to compute the same value). Luke
main = do ms <- sequence $ replicate 2 newEmptyMVar mapM_ (forkIO . heavytask) $ tail ms heavytask $ head ms ms' <- mapM takeMVar ms mapM_ print ms'
Regards, Paul _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Thanks Luke, and everyone else. Ok, back to the drawing board.
Paul
From: Luke Palmer [mailto:lrpalmer@gmail.com]
Sent: 19 December 2008 16:44
To: Paul Keir
Cc: haskell-cafe@haskell.org
Subject: Re: [Haskell-cafe] forkIO on multicore[MESSAGE NOT SCANNED]
On Fri, Dec 19, 2008 at 9:27 AM, Paul Keir

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Paul Keir wrote:
Thanks Luke, and everyone else. Ok, back to the drawing board.
You may be interested in this: http://cgi.cse.unsw.edu.au/~dons/blog/2007/11/29 - - Jake -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAklL0GQACgkQye5hVyvIUKmnPgCgg6dK2KeBgE0T3T1He53+jSOb SkcAoMiQTdIvqQG5At/Q+mg7Ybos4JWq =+y/+ -----END PGP SIGNATURE-----

I did indeed intend for the threads to evaluate before writing to the two variables, thanks.
heavytask m = putMVar m $! (fibs !! 100000)
I now see a time difference, but as you suggested, in the wrong direction (1.5s for one, and 3.6s for two threads). I was hoping for each thread to independently calculate a fib number (but only to easily give them something to do) . Are the threads really in competition though? I'm hoping for each thread to write its own result; so giving the same answer twice. With the "-N2" and "-threaded" switches, can I not expect each thread to run on a separate core? Paul

On Fri, Dec 19, 2008 at 3:16 PM, Paul Keir
can I not expect each thread to run on a separate core?
Try moving 'fibs' inside 'heavyTask', like heavytask m = putMVar m $! (let fibs = 0 : 1 : zipWith (+) fibs (tail fibs) in (fibs !! 100000)) Maybe this may trick the compiler into not sharing fibs. -- Felipe.
participants (7)
-
Don Stewart
-
Duncan Coutts
-
Felipe Lessa
-
Jake McArthur
-
Luke Palmer
-
Paul Keir
-
Simon Marlow