
I was trying to speed up a program that I wrote and so I thought about using multiple threads. I have a quite easy parallel program and I did the following do subRes <- MVar.newMVar [] putStrLn "starting threads" subV <- flip mapM [0 .. (nThreads - 1)] $ ( \i -> do subR <- MVar.newEmptyMVar let writeRes r = do { MVar.putMVar subR r } forkOS (do let r=eval (startData !! i) writeRes $! r putStr $ "writtenRes") return subR ) putStrLn "started threads" toFold <- mapM MVar.takeMVar subV putStrLn "about to fold" return $ foldl1 mergeRes toFold I know that the threads really calculate what I want, and as soon as they are finished I get the result. The problem is that I have no speed up, the time is basically the sum of the time for the two threads. I thought that ghc now would take advantage of the two cpus if I compiled with -threaded. Am I wrong, do I need some special flag, a newer version of the compiler (I have 6.6.20070129), or it is just normal? Thanks Fawzi

On Sat, Apr 14, 2007 at 12:27:10AM +0200, Fawzi Mohamed wrote:
I was trying to speed up a program that I wrote and so I thought about using multiple threads. I have a quite easy parallel program and I did the following
do subRes <- MVar.newMVar [] putStrLn "starting threads" subV <- flip mapM [0 .. (nThreads - 1)] $ ( \i -> do subR <- MVar.newEmptyMVar let writeRes r = do { MVar.putMVar subR r } forkOS (do let r=eval (startData !! i) writeRes $! r putStr $ "writtenRes") return subR ) putStrLn "started threads" toFold <- mapM MVar.takeMVar subV putStrLn "about to fold" return $ foldl1 mergeRes toFold
I know that the threads really calculate what I want, and as soon as they are finished I get the result. The problem is that I have no speed up, the time is basically the sum of the time for the two threads. I thought that ghc now would take advantage of the two cpus if I compiled with -threaded. Am I wrong, do I need some special flag, a newer version of the compiler (I have 6.6.20070129), or it is just normal?
./MyProgram +RTS -N2 where N is your CPU count. (that said, DO NOT USE THREADS IF AT ALL POSSIBLE, they are ugly and cause heisenbugs, if you want paralelism `par` from Control.Parallel is to be preferred if at all possible since it is deterministic) Stefan

Il giorno Apr 14, 2007, alle ore 12:33 AM, Stefan O'Rear ha scritto:
On Sat, Apr 14, 2007 at 12:27:10AM +0200, Fawzi Mohamed wrote:
I was trying to speed up a program that I wrote and so I thought about using multiple threads. I have a quite easy parallel program and I did the following
do subRes <- MVar.newMVar [] putStrLn "starting threads" subV <- flip mapM [0 .. (nThreads - 1)] $ ( \i -> do subR <- MVar.newEmptyMVar let writeRes r = do { MVar.putMVar subR r } forkOS (do let r=eval (startData !! i) writeRes $! r putStr $ "writtenRes") return subR ) putStrLn "started threads" toFold <- mapM MVar.takeMVar subV putStrLn "about to fold" return $ foldl1 mergeRes toFold
I know that the threads really calculate what I want, and as soon as they are finished I get the result. The problem is that I have no speed up, the time is basically the sum of the time for the two threads. I thought that ghc now would take advantage of the two cpus if I compiled with -threaded. Am I wrong, do I need some special flag, a newer version of the compiler (I have 6.6.20070129), or it is just normal?
./MyProgram +RTS -N2
where N is your CPU count.
thanks, that was it.
(that said, DO NOT USE THREADS IF AT ALL POSSIBLE, they are ugly and cause heisenbugs, if you want paralelism `par` from Control.Parallel is to be preferred if at all possible since it is deterministic)
in theory yes, but I am quite used to programm with threads and even mpi. I have looked at Control.Parallel (and the nice article on it), but there is no easy way tell (at least that I saw) to leave the whole calculation to a sub thread. Actually my code should be equivalent to parMap rwhnf, but I get the following results: parMap 3.63user 0.02system 0:01.97elapsed 185%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1039minor)pagefaults 0swaps threads 3.14user 0.02system 0:01.68elapsed 187%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1041minor)pagefaults 0swaps I suppose that it is because I have a thread for each element in the list plus a main thread vs just one thread per element in the list, but I am not sure, if someone has some ideas... With threads (now the I managed to have some speedup) I can use a workers/queue approach and have a better load balancing. I had looked at strategies, but I need first a breath first traversal of a graph generated on the fly (to be parallel) and then a depth first traversal (to avoid space leaks), and I found no easy way to do it with strategies, so I did it by hand. by the way is there a way to know how many processors are available to the program (to make the strategy or thread control depend on it)? Fawzi

On Sat, Apr 14, 2007 at 01:31:58AM +0200, Fawzi Mohamed wrote:
Il giorno Apr 14, 2007, alle ore 12:33 AM, Stefan O'Rear ha scritto:
On Sat, Apr 14, 2007 at 12:27:10AM +0200, Fawzi Mohamed wrote:
I was trying to speed up a program that I wrote and so I thought about using multiple threads. I have a quite easy parallel program and I did the following
do subRes <- MVar.newMVar [] putStrLn "starting threads" subV <- flip mapM [0 .. (nThreads - 1)] $ ( \i -> do subR <- MVar.newEmptyMVar let writeRes r = do { MVar.putMVar subR r } forkOS (do
Getting rid of that forkOS will help a LOT - forkOS is very expensive and gives no benefit. It exists only for the benefit of FFI libs like OpenGL that use thread local state. Plain forkIO uses a thread pool in the runtime; it is even more parallel than forkOS, and extremely cheap. The fact that you are using forkOS is a dead giveaway of a documentation bug somewhere. Please report it, so nobody else makes this mistake!
let r=eval (startData !! i) writeRes $! r putStr $ "writtenRes") return subR ) putStrLn "started threads" toFold <- mapM MVar.takeMVar subV putStrLn "about to fold" return $ foldl1 mergeRes toFold
I know that the threads really calculate what I want, and as soon as they are finished I get the result. The problem is that I have no speed up, the time is basically the sum of the time for the two threads. I thought that ghc now would take advantage of the two cpus if I compiled with -threaded. Am I wrong, do I need some special flag, a newer version of the compiler (I have 6.6.20070129), or it is just normal?
./MyProgram +RTS -N2
where N is your CPU count.
thanks, that was it.
(that said, DO NOT USE THREADS IF AT ALL POSSIBLE, they are ugly and cause heisenbugs, if you want paralelism `par` from Control.Parallel is to be preferred if at all possible since it is deterministic)
in theory yes, but I am quite used to programm with threads and even mpi. I have looked at Control.Parallel (and the nice article on it), but there is no easy way tell (at least that I saw) to leave the whole calculation to a sub thread. Actually my code should be equivalent to parMap rwhnf, but I get the following results:
parMap 3.63user 0.02system 0:01.97elapsed 185%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1039minor)pagefaults 0swaps
threads 3.14user 0.02system 0:01.68elapsed 187%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1041minor)pagefaults 0swaps
Those look equally parallel to me :)
I suppose that it is because I have a thread for each element in the list plus a main thread vs just one thread per element in the list, but I am not sure, if someone has some ideas...
With threads (now the I managed to have some speedup) I can use a workers/queue approach and have a better load balancing.
Threads are supposed to be cheap. If implementing a thread pool by hand ever gives a speedup, it's a bug.
I had looked at strategies, but I need first a breath first traversal of a graph generated on the fly (to be parallel) and then a depth first traversal (to avoid space leaks), and I found no easy way to do it with strategies, so I did it by hand.
by the way is there a way to know how many processors are available to the program (to make the strategy or thread control depend on it)?
Stefan

Thanks again to the answers Stefan, Il giorno Apr 14, 2007, alle ore 1:41 AM, Stefan O'Rear ha scritto:
On Sat, Apr 14, 2007 at 01:31:58AM +0200, Fawzi Mohamed wrote:
Il giorno Apr 14, 2007, alle ore 12:33 AM, Stefan O'Rear ha scritto:
On Sat, Apr 14, 2007 at 12:27:10AM +0200, Fawzi Mohamed wrote:
I was trying to speed up a program that I wrote and so I thought about using multiple threads. I have a quite easy parallel program and I did the following
do subRes <- MVar.newMVar [] putStrLn "starting threads" subV <- flip mapM [0 .. (nThreads - 1)] $ ( \i -> do subR <- MVar.newEmptyMVar let writeRes r = do { MVar.putMVar subR r } forkOS (do
Getting rid of that forkOS will help a LOT - forkOS is very expensive and gives no benefit. It exists only for the benefit of FFI libs like OpenGL that use thread local state.
Plain forkIO uses a thread pool in the runtime; it is even more parallel than forkOS, and extremely cheap.
Yes indeed using forkIO (that I had removed only trying to find a way to make my program faster) one gets basically the same results as with forkOS. thread(IO) 3.10user 0.02system 0:01.68elapsed 184%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1037minor)pagefaults 0swaps
The fact that you are using forkOS is a dead giveaway of a documentation bug somewhere. Please report it, so nobody else makes this mistake!
No actually I had used forkIO before, but seeing no speedup I tried also forkOS hoping that using it would speed things up...
let r=eval (startData !! i) writeRes $! r putStr $ "writtenRes") return subR ) putStrLn "started threads" toFold <- mapM MVar.takeMVar subV putStrLn "about to fold" return $ foldl1 mergeRes toFold
[...]Actually my code should be equivalent to parMap rwhnf, but I get the following results:
parMap 3.63user 0.02system 0:01.97elapsed 185%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1039minor)pagefaults 0swaps
threads(OS) 3.14user 0.02system 0:01.68elapsed 187%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1041minor)pagefaults 0swaps
Those look equally parallel to me :)
exactly the same code, and parmap is 17% slower with respect to creating threads manually for each element in the list and waiting for them all to finish... I can understand that maybe this way one is guaranteed that if the original program ends than the par version will end too (one can spark bottom), but it is just less efficient if it is known that one needs the data of the threads to partially calculate it together with them. actually a parMap that executes the last element of the list before going on to evaluate it, should be very close to the example with explicit threads (if the workload is more or less balanced). Sure enough using mParMap :: (a->b) -> [a] -> [b] mParMap f (l0:l1:lr) = let n0 = f l0 nr = mParMap f (l1:lr) in n0 `par` (nr `seq` (n0:nr)) mParMap f (l0:[]) = let n0 = f l0 in n0 `seq` n0:[] mParMap f [] = [] I get timings similar to the thread example. mParMap 3.15user 0.02system 0:01.72elapsed 184%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1040minor)pagefaults 0swaps Actually I think that 99% of the time when one wants a parallel map he wants something like this, not like the parallel map in strategies, maybe there should be a parList' defined this way and a corresponding parMap'
I suppose that it is because I have a thread for each element in the list plus a main thread vs just one thread per element in the list, but I am not sure, if someone has some ideas...
With threads (now the I managed to have some speedup) I can use a workers/queue approach and have a better load balancing.
Threads are supposed to be cheap. If implementing a thread pool by hand ever gives a speedup, it's a bug.
but making my worker threads if I know the number of worker will be more efficient, my program is extremely parallel, but putting a par everywhere would be very memory costly and would probably break the program in the wrong places, I know where I should spark threads so that I have few high-level tasks and coarse grain parallelism, and if I know the number of workers I can do much more efficiently by hand. Furthermore I can put the tasks in a queue in order of decreasing cost and get a rather good load balancing without having to think too much about a static distribution. So having a thread pool by hand does make sense, doing it with OS threads and trying to beat the GHC runtime does not.
by the way is there a way to know how many processors are available to the program (to make the strategy or thread control depend on it)?
anyone can answer to this? thanks Fawzi

On 4/14/07, Fawzi Mohamed
but making my worker threads if I know the number of worker will be more efficient, my program is extremely parallel, but putting a par everywhere would be very memory costly and would probably break the program in the wrong places, I know where I should spark threads so that I have few high-level tasks and coarse grain parallelism, and if I know the number of workers I can do much more efficiently by hand. Furthermore I can put the tasks in a queue in order of decreasing cost and get a rather good load balancing without having to think too much about a static distribution. So having a thread pool by hand does make sense, doing it with OS threads and trying to beat the GHC runtime does not.
I think you should probably consider the extremely lightweight forkIO threads as your "work items" and the GHC runtime as your thread pool system (it will find out how many threads you want using the RTS options and distribute it for you). If you're worried about memory efficiency you can tweak the initial stack sizes for threads etc. using runtime options. It's still true that you don't want to fork off trivial computations in a separate thread, BUT that's true for manual work item queues as well (you'd want each work item to be a substantial amount of computation because there is overhead per item). E.g. if you have a list you might not want one thread per element (and you wouldn't want one work item per element either) if the per element tasks are fairly trivial, so you'd first group the list into chunks, and then let each chunk be a work item (i.e. spawn a forkIO thread to process it). I'd be interested in seeing benchmarks on this, but I do think that you'll be better off just spawning a lightweight thread per task, rather than first wrapping it in some data structure as a work item, then putting it in a queue, then popping items of the queue into threads. Seems that doing it that way would just be duplicating work. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862

Il giorno Apr 14, 2007, alle ore 2:45 PM, Sebastian Sylvan ha scritto:
I think you should probably consider the extremely lightweight forkIO threads as your "work items" and the GHC runtime as your thread pool system (it will find out how many threads you want using the RTS options and distribute it for you). If you're worried about memory efficiency you can tweak the initial stack sizes for threads etc. using runtime options.
It's still true that you don't want to fork off trivial computations in a separate thread, BUT that's true for manual work item queues as well (you'd want each work item to be a substantial amount of computation because there is overhead per item). E.g. if you have a list you might not want one thread per element (and you wouldn't want one work item per element either) if the per element tasks are fairly trivial, so you'd first group the list into chunks, and then let each chunk be a work item ( i.e. spawn a forkIO thread to process it).
yes, but to build the optimal chunk size one would like to know the number of working threads. So again, any way to know it at runtime? or it is a bad practice to ask?
I'd be interested in seeing benchmarks on this, but I do think that you'll be better off just spawning a lightweight thread per task, rather than first wrapping it in some data structure as a work item, then putting it in a queue, then popping items of the queue into threads. Seems that doing it that way would just be duplicating work.
good idea, thanks, I will try.. Fawzi

Fawzi Mohamed wrote:
Il giorno Apr 14, 2007, alle ore 2:45 PM, Sebastian Sylvan ha scritto:
I think you should probably consider the extremely lightweight forkIO threads as your "work items" and the GHC runtime as your thread pool system (it will find out how many threads you want using the RTS options and distribute it for you). If you're worried about memory efficiency you can tweak the initial stack sizes for threads etc. using runtime options.
It's still true that you don't want to fork off trivial computations in a separate thread, BUT that's true for manual work item queues as well (you'd want each work item to be a substantial amount of computation because there is overhead per item). E.g. if you have a list you might not want one thread per element (and you wouldn't want one work item per element either) if the per element tasks are fairly trivial, so you'd first group the list into chunks, and then let each chunk be a work item ( i.e. spawn a forkIO thread to process it).
yes, but to build the optimal chunk size one would like to know the number of working threads. So again, any way to know it at runtime? or it is a bad practice to ask?
There's no way currently, but it would be a useful thing to have. Cheers, Simon
participants (4)
-
Fawzi Mohamed
-
Sebastian Sylvan
-
Simon Marlow
-
Stefan O'Rear