
I have two parallel algorithms that use the lightweight GHC thread and forkIO. I compile them using the -threaded (or -smp) option, and run both with +RTS -N2 -RTS command line option. QSort is able to make use of the dual cores on my laptop -- "top" shows that two threads show up and both CPUs are utilized, and "time" it will give something like this: real 0m0.502s user 0m0.872s sys 0m0.004s But Prime can only make use of one core, as shown by "top". "time" gives real 0m9.112s user 0m9.093s sys 0m0.028s Because forkOS is not used anywhere, the decision of running them on 1 or 2 OS threads seem rather arbitary. Why? Regards, Paul L
import Control.Concurrent import System.Random import Data.Array.MArray import Data.Array.IO import System.IO.Unsafe import Control.Exception
1. Quick Sort
testQSort' n verbose = do let b = (0, n - 1) arr <- newArray b 0 :: IO (IOUArray Int Int) initM' (mkStdGen 0) b arr waitForChildren qsortM' b arr waitForChildren if verbose then getElems arr >>= putStrLn . show else return ()
Initialize an array with random numbers.
initM' g (i, j) arr | j - i < 10000 = fillArr g i j where fillArr g i j = if i > j then return () else do let (v, g') = next g writeArray arr i v >> fillArr g' (i + 1) j initM' g (i, j) arr = do let k = (i + j) `div` 2 (g1, g2) = split g forkChild $ initM' g1 (i, k) arr forkChild $ initM' g2 (k + 1, j) arr return ()
qsortM' (i, j) arr = qsort' (i, j) where qsort' (i, j) = if j <= i then return () else do k <- split i j if j - i > 10000 then (forkChild $ qsort' (i, k - 1)) >> return () else qsort' (i, k - 1) qsort' (k + 1, j) split left right = do v <- readArray arr right let split' i j = if j == right then swap i right v >> return i else do b <- readArray arr j if b <= v then (swap i j b) >> split' (i + 1) (j + 1) else split' i (j + 1) split' left left swap i j b = do a <- readArray arr i writeArray arr i b writeArray arr j a
2. Prime
testPrime' n verbose = do arr <- newArray (0, n) True :: IO (IOUArray Int Bool) primeM' arr n waitForChildren if verbose then getElems arr >>= putStrLn . show . map fst . filter snd . zip [0..] else return ()
primeM' arr n = do let p = truncate $ sqrt (fromIntegral n) + 1 remove i = if i > p then return () else do spawnRemover (i + 1) remove' (i + i) where remove' j = if j > n then return () else do writeArray arr j False remove' (j + i) spawnRemover j = if j > n then return () else do t <- readArray arr j if t then forkChild (remove j) else spawnRemover (j + 1) remove 2
Manage thread termination
children :: MVar [MVar ()] children = unsafePerformIO (newMVar [])
waitForChildren :: IO () waitForChildren = do cs <- takeMVar children case cs of [] -> putMVar children cs m:ms -> do putMVar children ms takeMVar m waitForChildren
forkChild :: IO () -> IO () forkChild io = do mvar <- newEmptyMVar childs <- takeMVar children putMVar children (mvar:childs) forkIO (io `finally` putMVar mvar ()) return ()

replying to my own message... the behavior is only when -O is used
during compilation, otherwise they both run on 2 cores but at a much
lower (1/100) speed.
On 7/6/07, Paul L
I have two parallel algorithms that use the lightweight GHC thread and forkIO. I compile them using the -threaded (or -smp) option, and run both with +RTS -N2 -RTS command line option.
QSort is able to make use of the dual cores on my laptop -- "top" shows that two threads show up and both CPUs are utilized, and "time" it will give something like this:
real 0m0.502s user 0m0.872s sys 0m0.004s
But Prime can only make use of one core, as shown by "top". "time" gives
real 0m9.112s user 0m9.093s sys 0m0.028s
Because forkOS is not used anywhere, the decision of running them on 1 or 2 OS threads seem rather arbitary. Why?
Regards, Paul L
import Control.Concurrent import System.Random import Data.Array.MArray import Data.Array.IO import System.IO.Unsafe import Control.Exception
1. Quick Sort
testQSort' n verbose = do let b = (0, n - 1) arr <- newArray b 0 :: IO (IOUArray Int Int) initM' (mkStdGen 0) b arr waitForChildren qsortM' b arr waitForChildren if verbose then getElems arr >>= putStrLn . show else return ()
Initialize an array with random numbers.
initM' g (i, j) arr | j - i < 10000 = fillArr g i j where fillArr g i j = if i > j then return () else do let (v, g') = next g writeArray arr i v >> fillArr g' (i + 1) j initM' g (i, j) arr = do let k = (i + j) `div` 2 (g1, g2) = split g forkChild $ initM' g1 (i, k) arr forkChild $ initM' g2 (k + 1, j) arr return ()
qsortM' (i, j) arr = qsort' (i, j) where qsort' (i, j) = if j <= i then return () else do k <- split i j if j - i > 10000 then (forkChild $ qsort' (i, k - 1)) >> return () else qsort' (i, k - 1) qsort' (k + 1, j) split left right = do v <- readArray arr right let split' i j = if j == right then swap i right v >> return i else do b <- readArray arr j if b <= v then (swap i j b) >> split' (i + 1) (j + 1) else split' i (j + 1) split' left left swap i j b = do a <- readArray arr i writeArray arr i b writeArray arr j a
2. Prime
testPrime' n verbose = do arr <- newArray (0, n) True :: IO (IOUArray Int Bool) primeM' arr n waitForChildren if verbose then getElems arr >>= putStrLn . show . map fst . filter snd . zip [0..] else return ()
primeM' arr n = do let p = truncate $ sqrt (fromIntegral n) + 1 remove i = if i > p then return () else do spawnRemover (i + 1) remove' (i + i) where remove' j = if j > n then return () else do writeArray arr j False remove' (j + i) spawnRemover j = if j > n then return () else do t <- readArray arr j if t then forkChild (remove j) else spawnRemover (j + 1) remove 2
Manage thread termination
children :: MVar [MVar ()] children = unsafePerformIO (newMVar [])
waitForChildren :: IO () waitForChildren = do cs <- takeMVar children case cs of [] -> putMVar children cs m:ms -> do putMVar children ms takeMVar m waitForChildren
forkChild :: IO () -> IO () forkChild io = do mvar <- newEmptyMVar childs <- takeMVar children putMVar children (mvar:childs) forkIO (io `finally` putMVar mvar ()) return ()

ninegua:
replying to my own message... the behavior is only when -O is used during compilation, otherwise they both run on 2 cores but at a much lower (1/100) speed.
Hmm, any change with -O2? Is the optimiser changing the code such that the scheduler doesn't get to switch threads as often? If you change the thread scheduler switching rate does that change anything? See the GHC user's guide for more details: 7.12.1.3.�Scheduling policy for concurrent threads Runnable threads are scheduled in round-robin fashion. Context switches are signalled by the generation of new sparks or by the expiry of a virtual timer (the timer interval is configurable with the -C[<num>] RTS option). However, a context switch doesn't really happen until the current heap block is full. You can't get any faster context switching than this. When a context switch occurs, pending sparks which have not already been reduced to weak head normal form are turned into new threads. However, there is a limit to the number of active threads (runnable or blocked) which are allowed at any given time. This limit can be adjusted with the -t <num> RTS option (the default is 32). Once the thread limit is reached, any remaining sparks are deferred until some of the currently active threads are completed. Perhaps SimonM can shed some light here?
On 7/6/07, Paul L
wrote: I have two parallel algorithms that use the lightweight GHC thread and forkIO. I compile them using the -threaded (or -smp) option, and run both with +RTS -N2 -RTS command line option.
QSort is able to make use of the dual cores on my laptop -- "top" shows that two threads show up and both CPUs are utilized, and "time" it will give something like this:
real 0m0.502s user 0m0.872s sys 0m0.004s
But Prime can only make use of one core, as shown by "top". "time" gives
real 0m9.112s user 0m9.093s sys 0m0.028s
Because forkOS is not used anywhere, the decision of running them on 1 or 2 OS threads seem rather arbitary. Why?
Regards, Paul L
import Control.Concurrent import System.Random import Data.Array.MArray import Data.Array.IO import System.IO.Unsafe import Control.Exception
1. Quick Sort
testQSort' n verbose = do let b = (0, n - 1) arr <- newArray b 0 :: IO (IOUArray Int Int) initM' (mkStdGen 0) b arr waitForChildren qsortM' b arr waitForChildren if verbose then getElems arr >>= putStrLn . show else return ()
Initialize an array with random numbers.
initM' g (i, j) arr | j - i < 10000 = fillArr g i j where fillArr g i j = if i > j then return () else do let (v, g') = next g writeArray arr i v >> fillArr g' (i + 1) j initM' g (i, j) arr = do let k = (i + j) `div` 2 (g1, g2) = split g forkChild $ initM' g1 (i, k) arr forkChild $ initM' g2 (k + 1, j) arr return ()
qsortM' (i, j) arr = qsort' (i, j) where qsort' (i, j) = if j <= i then return () else do k <- split i j if j - i > 10000 then (forkChild $ qsort' (i, k - 1)) >> return () else qsort' (i, k - 1) qsort' (k + 1, j) split left right = do v <- readArray arr right let split' i j = if j == right then swap i right v >> return i else do b <- readArray arr j if b <= v then (swap i j b) >> split' (i + 1) (j + 1) else split' i (j + 1) split' left left swap i j b = do a <- readArray arr i writeArray arr i b writeArray arr j a
2. Prime
testPrime' n verbose = do arr <- newArray (0, n) True :: IO (IOUArray Int Bool) primeM' arr n waitForChildren if verbose then getElems arr >>= putStrLn . show . map fst . filter snd . zip [0..] else return ()
primeM' arr n = do let p = truncate $ sqrt (fromIntegral n) + 1 remove i = if i > p then return () else do spawnRemover (i + 1) remove' (i + i) where remove' j = if j > n then return () else do writeArray arr j False remove' (j + i) spawnRemover j = if j > n then return () else do t <- readArray arr j if t then forkChild (remove j) else spawnRemover (j + 1) remove 2
Manage thread termination
children :: MVar [MVar ()] children = unsafePerformIO (newMVar [])
waitForChildren :: IO () waitForChildren = do cs <- takeMVar children case cs of [] -> putMVar children cs m:ms -> do putMVar children ms takeMVar m waitForChildren
forkChild :: IO () -> IO () forkChild io = do mvar <- newEmptyMVar childs <- takeMVar children putMVar children (mvar:childs) forkIO (io `finally` putMVar mvar ()) return ()
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Donald Bruce Stewart wrote:
ninegua:
replying to my own message... the behavior is only when -O is used during compilation, otherwise they both run on 2 cores but at a much lower (1/100) speed.
Hmm, any change with -O2? Is the optimiser changing the code such that the scheduler doesn't get to switch threads as often? If you change the thread scheduler switching rate does that change anything?
See the GHC user's guide for more details:
7.12.1.3.�Scheduling policy for concurrent threads
Runnable threads are scheduled in round-robin fashion. Context switches are signalled by the generation of new sparks or by the expiry of a virtual timer (the timer interval is configurable with the -C[<num>] RTS option). However, a context switch doesn't really happen until the current heap block is full. You can't get any faster context switching than this.
When a context switch occurs, pending sparks which have not already been reduced to weak head normal form are turned into new threads. However, there is a limit to the number of active threads (runnable or blocked) which are allowed at any given time. This limit can be adjusted with the -t <num> RTS option (the default is 32). Once the thread limit is reached, any remaining sparks are deferred until some of the currently active threads are completed.
I think you got that from an old version of the users's guide - it certainly isn't in the 6.6.1 or HEAD versions of the docs. I don't have any specific advice about the program in this thread, but in my (limited) experience with debugging parallelism problems in GHC, these are common: (a) the child threads aren't doing any work, just accumulating a large thunk which gets evaluated by the main thread sequentially. (b) you have a sequential dependency somewhere (c) tight loops that don't allocate don't give the scheduler a chance to run and load-balance. (d) GHC's scheduler is too stupid I doubt that (c) is a problem for you: it normally occurs when you try to use par/seq and strategies, and are playing with parallel fibonacci. Here you are using forkIO which definitely allocates, so that shouldn't be a problem. (d) is quite possible. I once tried to parallelise the simple concurrency example from the language shootout, which essentially consists of a long chain of threads with data items being passed along the chain. I could only get any kind of speedup when I fixed half the chain on to each CPU, rather than using the automatic migration logic in the scheduler. You can use GHC.Conc.forkOn for this: forkOnIO :: Int -> IO () -> IO ThreadId pass it an integer T, and the thread will be stuck to CPU T `mod` N (where N is the number of CPUs). The RTS doesn't really phyisically fix its execution units to CPUs, but usually the OS manages to do a reasonable job of this. In GHC 6.8, hopefully we'll have some better tools for debugging parallelism performance problems. Michael Adams (who just finished an internship here at MSR) ported some of the GranSim visualisation tools to the current GHC, I have the patches sitting in my inbox ready to review. Cheers, Simon

Donald Bruce Stewart wrote:
Hmm, any change with -O2? Is the optimiser changing the code such that the scheduler doesn't get to switch threads as often? If you change the thread scheduler switching rate does that change anything?
The behavior only appears when -O or anything greater than -O is
applied. It does appear to be that thread switching isn't happening as
often as I wanted it for the Prime number example.
On 7/10/07, Simon Marlow
(a) the child threads aren't doing any work, just accumulating a large thunk which gets evaluated by the main thread sequentially.
this is unlikely, as it's using IO monad, which forces evaluation for things like array updates.
(b) you have a sequential dependency somewhere
also unlikely, because without -O it'd use two OS threads.
(c) tight loops that don't allocate don't give the scheduler a chance to run and load-balance.
I doubt that (c) is a problem for you: it normally occurs when you try to use par/seq and strategies, and are playing with parallel fibonacci. Here you are using forkIO which definitely allocates, so that shouldn't be a problem.
it's possible that the thread doesn't allocate much after the optimization. In the Prime number example, every thread actually spawns a new thread before doing its own work, and the work it does (function remove') will not spawn new threads. It is very likely that remove' is optimized to a simple loop as it's tail recursive. So each thread has little chance to give other thread a chance to run if it doesn't switch during thread spawning. Compare this to the QSort example, where each thread spawns a new thread to sort half of the array after splitting, and continue to sort the other half in the original thread. This could explain the difference. Indeed, after I insert a yield after "spawnRemover (i + 1)", it now happily crunches number on both CPUs. Thank you both for the suggestions! Regards, Paul L
participants (3)
-
dons@cse.unsw.edu.au
-
Paul L
-
Simon Marlow