Parallel combinator, performance advice

older
Monad transformer to consume a list

Neil Mitchell

7 Apr 2009 7 Apr '09

6:25 a.m.

Hi, I've written a parallel_ function, code attached. I'm looking for criticism, suggestions etc on how to improve the performance and fairness of this parallel construct. (If it turns out this construct is already in a library somewhere, I'd be interested in that too!) The problem I'm trying to solve is running system commands in parallel. Importantly (unlike other Haskell parallel stuff) I'm not expecting computationally heavy Haskell to be running in the threads, and only want a maximum of n commands to fire at a time. The way I'm trying to implement this is with a parallel_ function: parallel_ :: [IO a] -> IO () The semantics are that after parallel_ returns each action will have been executed exactly once. The implementation (attached) creates a thread pool of numCapabililties-1 threads, each of which reads from a task pool and attempts to do some useful work. I use an idempotent function to ensure that all work is done at most one, and a sequence_ to ensure all work is done at least once. Running a benchmark of issuing 1 million trivial tasks (create, modify, read and IO ref) the version without any parallelism is really fast (< 0.1 sec), and the version with parallelism is slow (> 10 sec). This could be entirely due to space leaks etc when queueing many tasks. I'm useful for any thoughts people might have! Thanks in advance, Neil

Attachments:

Parallel.hs (application/octet-stream — 1.3 KB)

Show replies by date

ChrisK

7 Apr 7 Apr

7:04 a.m.

You create one MVar for each task in order to ensure all the tasks are done. This is pretty heavyweight. You could create a single Control.Concurrent.QSemN to count the completed tasks, starting with 0. Each task is followed by signalQSemN with a value of 1. (I would use "finally"). As parallel_ launches the tasks it can count their number, then it would call waitQSemN for that quantity to have finished. -- Chris

Bulat Ziganshin

8:40 a.m.

Hello Neil, Tuesday, April 7, 2009, 2:25:12 PM, you wrote:

...

The problem I'm trying to solve is running system commands in parallel.

"system commands" means execution of external commands or just system calls inside Haskell?

...

Running a benchmark of issuing 1 million trivial tasks (create, modify, read and IO ref) the version without any parallelism is really fast (< 0.1 sec), and the version with parallelism is slow (> 10 sec). This could be entirely due to space leaks etc when queueing many tasks.

i think it's just because use of MVar/Chan is much slower than IORef activity. once i checked that on 1GHz cpu and got 2 million withMVar-s per second i don't understood exactly what you need, but my first shot is to create N threads executing commands from channel: para xs = do done <- newEmptyMVar chan <- newChan writeList2Chan chan (map Just xs ++ [Nothing]) replicateM_ numCapabilities $ do forkIO $ do forever $ do x <- readChan chan case x of Just cmd -> cmd Nothing -> putMVar done () takeMVar done -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

ChrisK

9:48 a.m.

Neil Mitchell wrote:

...

Sorry, accidentally sent before finishing the response! I also noticed you sent this directly to me, not to -cafe, was that intentional?

The mail/news gateway makes it look like that, but I also sent to the mailing list.

...

You mean something like:

parallel_ xs = sem <- createSemapore (length xs) enqueue [x >> signalSemapore sem | x <- xs] waitOnSemaphore sem

I thought of something like this, but then the thread that called parallel_ is blocked, which means if you fire off N threads you only get N-1 executing. If you have nested calls to parallel, then you end up with thread exhaustion. Is there a way to avoid that problem?

Thanks

Neil

Your parallel_ does not return until all operations are finished.

...

parallel_ (x:xs) = do ys <- mapM idempotent xs mapM_ addParallel ys sequence_ $ x : reverse ys

By the way, there is no obvious reason to insert "reverse" there. What I meant was something like:

...

para [] = return () para [x] = x para xs = do q <- newQSemN 0 let wrap x = finally x (signalQSemN q 1) go [y] n = wrap x >> waitQSemN q (succ n) go (y:ys) n = addParallel (wrap y) >> go ys $! succ n go xs 0

This is nearly identical to your code, and avoid creating the MVar for each operation. I use "finally" to ensure the count is correct, but if a worker threads dies then bas things will happen. You can replace finally with (>>) if speed is important. This is also lazy since the length of the list is not forced early.

Neil Mitchell

10:13 a.m.

...

...
The problem I'm trying to solve is running system commands in parallel.

...

"system commands" means execution of external commands or just system calls inside Haskell?

Calls to System.Cmd.system, i.e. running external console processes. It's a make system I'm writing, so virtually all the time is spent in calls to ghc etc. To Bulat: I should have been clearer with the spec. The idea is that multiple calls to paralell_ can execute, and a function executing inside parallel_ can itself call parallel_. For this reason I need one top-level thread pool, which requires unsafePerformIO. If I create a thread pool every time, I end up with more threads than I want.

...

Your parallel_ does not return until all operations are finished.

...
parallel_ (x:xs) = do ys <- mapM idempotent xs mapM_ addParallel ys sequence_ $ x : reverse ys

By the way, there is no obvious reason to insert "reverse" there.

There is a reason :-) Imagine I do parallel_ [a,b,c] That's roughly doing (if b' is idempotent b): enqueue b' enqueue c' a b' c' If while executing a the thread pool starts on b', then after I've finished a, I end up with both threads waiting for b', and nothing doing c'. If I do a reverse, then the thread pool and me are starting at different ends, so if we lock then I know it's something important to me that the thread pool started first. It's still not idea, but it happens less often.

...

What I meant was something like:

...
para [] = return () para [x] = x para xs = do q <- newQSemN 0 let wrap x = finally x (signalQSemN q 1) go [y] n = wrap x >> waitQSemN q (succ n) go (y:ys) n = addParallel (wrap y) >> go ys $! succ n go xs 0

This is nearly identical to your code, and avoid creating the MVar for each operation. I use "finally" to ensure the count is correct, but if a worker threads dies then bas things will happen. You can replace finally with (>>) if speed is important.

Consider a thread pool with 2 threads and the call parallel_ [parallel_ [b,c],a] You get the sequence: enqueue (parallel_ [b,c]) a wait on parallel_ [b,c] While you are executing a, a thread pool starts: enqueue b c wait for b Now you have all the threads waiting, and no one dealing with the thread pool. This results in deadlock. I guess the "nested calls to parallel_" bit is the part of the spec that makes everything much harder! Thanks Neil

Bulat Ziganshin

10:50 a.m.

New subject: Re[2]: Parallel combinator, performance advice

Hello Neil, Tuesday, April 7, 2009, 6:13:29 PM, you wrote:

...

Calls to System.Cmd.system, i.e. running external console processes. It's a make system I'm writing, so virtually all the time is spent in calls to ghc etc.

...

To Bulat: I should have been clearer with the spec. The idea is that multiple calls to paralell_ can execute, and a function executing inside parallel_ can itself call parallel_. For this reason I need one top-level thread pool, which requires unsafePerformIO. If I create a thread pool every time, I end up with more threads than I want.

this is smth new to solve i propose to use concept similar to Capability of GHC RTS: we have one Capability provided by thread calling para and N-1 Capabilities provided by your thread pool. all that we need is to reuse current thread Capability as part of pool! para xs = do sem <- newQSem for xs $ \x -> do writeChan chan (x `finally` signalQSem sem) tid <- forkIO (executing commands from chan...) waitQSem sem killThread tid instead of killThread we really should send pseudo-job (like my Nothing value) that will led to self-killing of job that gets this signal this solution still may lead to a bit more or less than N threads executed at the same time. your turn! -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin

11 a.m.

New subject: Re[3]: Parallel combinator, performance advice

Hello Bulat, Tuesday, April 7, 2009, 6:50:14 PM, you wrote:

...

tid <- forkIO (executing commands from chan...) waitQSem sem killThread tid

...

instead of killThread we really should send pseudo-job (like my Nothing value) that will led to self-killing of job that gets this signal

...

this solution still may lead to a bit more or less than N threads executed at the same time. your turn!

solved! every job should go together with Bool flag `killItself`. last job should have this flag set to True. thread will execute job and kill itself if this flag is True. so we get strong guarantees that there are exactly N threads in the system: para xs = do sem <- newQSem for (init xs) $ \x -> do writeChan chan (x `finally` signalQSem sem, False) writeChan chan (last x `finally` signalQSem sem, True) -- tid <- forkIO $ do let cycle = do (x,flag) <- readChan chan x unless flag cycle cycle -- waitQSem sem btw, this problem looks a great contribution into "Haskell way" book of exercises -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

ChrisK

11:01 a.m.

Neil Mitchell wrote:

...

I guess the "nested calls to parallel_" bit is the part of the spec that makes everything much harder!

Thanks

Neil

Yes. Much more annoying. But the problem here is generic. To avoid it you must never allow all thread to block at once. The parallel_ function is such a job, so you solved this with the 'idempotent' trick. You solution works by blocking all but 1 thread. 1a) Some worker thread 1 executes parallel_ with some jobs 1b) These get submitted the work queue 'chan' 1c) worker thread 1 starts on those same jobs, ignoring the queue 1d) worker thread 1 reaches the job being processed by thread 2 1e) worker thread 1 blocks until the jobs is finished in modifyMVar 2a) Worker thread 2 grabs a job posted by thread 1, that calls parallel_ 2b) This batch of jobs gets submitted to the work queue 'chan' 2c) worker thread 2 starts on those same jobs, ignoring the queue 1d) worker thread 2 reaches the job being processed by thread 3 1e) worker thread 2 blocks until the jobs is finished in modifyMVar 3...4...5... And now only 1 thread is still working, and it has to work in series. I think I can fix this...

Bulat Ziganshin

11:22 a.m.

New subject: Re[2]: Parallel combinator, performance advice

Hello Neil, Tuesday, April 7, 2009, 6:13:29 PM, you wrote:

...

Consider a thread pool with 2 threads and the call parallel_ [parallel_ [b,c],a]

...

You get the sequence: enqueue (parallel_ [b,c]) a wait on parallel_ [b,c]

...

While you are executing a, a thread pool starts: enqueue b c wait for b

...

Now you have all the threads waiting, and no one dealing with the thread pool. This results in deadlock.

i think the only way to solve this problem is to create one more thread each time. let's see: on every call to para you need to alloc one thread to wait for jobs completion. so on each nested call to para you have minus one worker thread. finally you will eat them all! so you need to make fork: one thread should serve jobs and another one wait for completion of this jobs bucket. and with killItself flag you will finish superfluous thread JIT -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Neil Mitchell

11:33 a.m.

New subject: Re[2]: Parallel combinator, performance advice

Hi Sebastian:

...

How about using unsafeInterleaveIO to get a lazy suspension of the result of each action, and then using par to spark off each of them? If that works you can reuse the existing task-parallel system of GHC to do the heavily lifting for you, instead of having to write your own.

par is likely to spark all the computations, and then switch between them - which will mean I've got more than N things running in parallel.

...

i think the only way to solve this problem is to create one more thread each time. let's see: on every call to para you need to alloc one thread to wait for jobs completion. so on each nested call to para you have minus one worker thread. finally you will eat them all!

so you need to make fork: one thread should serve jobs and another one wait for completion of this jobs bucket. and with killItself flag you will finish superfluous thread JIT

You are right, your previous solution was running at N-1 threads if the order was a little unlucky. I've attached a new version which I think gives you N threads always executing at full potential. It's basically your idea from the last post, with the main logic being: parallel_ (x1:xs) = do sem <- newQSem $ 1 - length xs forM_ xs $ \x -> writeChan queue (x >> signalQSem sem, False) x1 addWorker waitQSem sem writeChan queue (signalQSem sem, True) waitQSem sem Where the second flag being True = kill, as you suggested. I think I've got the semaphore logic right - anyone want to see if I missed something? With this new version running 1000000 items takes ~1 second, instead of ~10 seconds before, so an order of magnitude improvement, and greater fairness. Very nice, thanks for all the help! Thanks Neil

Bulat Ziganshin

11:45 a.m.

New subject: Re[4]: Parallel combinator, performance advice

Hello Neil, Tuesday, April 7, 2009, 7:33:25 PM, you wrote:

...

...
How about using unsafeInterleaveIO to get a lazy suspension of the result of each action, and then using par to spark off each of them? If that works you can reuse the existing task-parallel system of GHC to do the heavily lifting for you, instead of having to write your own.

...

par is likely to spark all the computations, and then switch between them - which will mean I've got more than N things running in parallel.

par/GHC RTS limits amount of Haskell threads running simultaneously. with a system call marked as safe, Capability will be freed while we execute external program so nothing will be limited except for amount of tasks *starting* (as opposite to running) simultaneously :))) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin

11:50 a.m.

New subject: Re[4]: Parallel combinator, performance advice

Hello Neil, Tuesday, April 7, 2009, 7:33:25 PM, you wrote:

...

parallel_ (x1:xs) = do sem <- newQSem $ 1 - length xs forM_ xs $ \x -> writeChan queue (x >> signalQSem sem, False) x1 addWorker waitQSem sem writeChan queue (signalQSem sem, True) waitQSem sem

...

Where the second flag being True = kill, as you suggested. I think I've got the semaphore logic right - anyone want to see if I missed something?

Neil, executing x1 directly in parallel_ is incorrect idea. you should have N worker threads, not N-1 threads plus one job executed in main thread. imagine that you have 1000 jobs and N=4. that you will got here is 3 threads each executed 333 jobs and 1 job executed by main thread so you still need to detach one more worker job and finish it just before we are ready to finish waiting for QSem and continue in main thread which is sole reason why we need killItself flag. in this code snipped this flag is completely useless, btw -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin

12:10 p.m.

New subject: Re[5]: Parallel combinator, performance advice

Hello Bulat, Tuesday, April 7, 2009, 7:50:08 PM, you wrote:

...

...
parallel_ (x1:xs) = do sem <- newQSem $ 1 - length xs forM_ xs $ \x -> writeChan queue (x >> signalQSem sem, False) x1 addWorker waitQSem sem writeChan queue (signalQSem sem, True) waitQSem sem

...

Neil, executing x1 directly in parallel_ is incorrect idea.

forget this. but it still a bit suboptimal: after everything was finished, we schedule one more empty job and wait while some worker thread will pick up it. it will go into Chan after all jobs scheduled at the time our jobs was executed so that we are doing here is eventually don't do any internal activity while we have all N external programs running instead, my solution packed this flag together with last job so once last job is finished we are immediately returned from parallel_ so other internal activity may go on -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Neil Mitchell

12:36 p.m.

New subject: Re[5]: Parallel combinator, performance advice

...

par is likely to spark all the computations, and then switch between them - which will mean I've got more than N things running in parallel.

| par/GHC RTS limits amount of Haskell threads running simultaneously. | with a system call marked as safe, Capability will be freed while we | execute external program so nothing will be limited except for amount | of tasks *starting* (as opposite to running) simultaneously :))) Yeah, I misspoke - I want to avoid starting N things.

...

...
...
parallel_ (x1:xs) = do sem <- newQSem $ 1 - length xs forM_ xs $ \x -> writeChan queue (x >> signalQSem sem, False) x1 addWorker waitQSem sem writeChan queue (signalQSem sem, True) waitQSem sem

...
Neil, executing x1 directly in parallel_ is incorrect idea.

It's a very slight optimisation, as it saves us queueing and dequeueing x1, since we know the worker we're about the spawn on the line below will grab x1 immediately.

...

forget this. but it still a bit suboptimal: after everything was finished, we schedule one more empty job and wait while some worker thread will pick up it. it will go into Chan after all jobs scheduled at the time our jobs was executed so that we are doing here is eventually don't do any internal activity while we have all N external programs running

instead, my solution packed this flag together with last job so once last job is finished we are immediately returned from parallel_ so other internal activity may go on

There is no guarantee that the last job finishes last. If the first job takes longer than the last job we'll be one thread short while waiting on the first job. It's a shame, since removing that additional writeChan isn't particularly useful. Thanks Neil

Bulat Ziganshin

12:40 p.m.

New subject: Re[6]: Parallel combinator, performance advice

Hello Bulat, Tuesday, April 7, 2009, 8:10:43 PM, you wrote:

...

...
...
parallel_ (x1:xs) = do sem <- newQSem $ 1 - length xs forM_ xs $ \x -> writeChan queue (x >> signalQSem sem, False) x1 addWorker waitQSem sem writeChan queue (signalQSem sem, True) waitQSem sem

...

...
Neil, executing x1 directly in parallel_ is incorrect idea.

...

forget this. but it still a bit suboptimal...

i think i realized why you use this schema. my solution may lead to N-1 worker threads in the system if last job is too small - after its execution we finish one thread and have just N-1 working threads until parallel_ will be finished but problem i mentioned in previous letter may also take place although it looks like less important. we may solve both problems by allowing worker thread to actively select its death time: it should die only at the moment when *last* job in bucket was finished - this guarantees us exactly N worker threads at any time. so: parallel_ (x1:xs) = do sem <- newQSem $ - length xs jobsLast <- newMVar (length xs) addWorker forM_ (x1:xs) $ \x -> do writeChan queue $ do x signalQSem sem modifyMVar jobsLast $ \jobs -> do return (jobs-1, jobs==0) -- waitQSem sem and modify last 3 lines of addWorker: addWorker :: IO () addWorker = do forkIO $ f `E.catch` \(e :: SomeException) -> throwTo mainThread $ ErrorCall "Control.Concurrent.Parallel: parallel thread died." return () where f :: IO () f = do act <- readChan queue kill <- act unless kill f -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Neil Mitchell

8 Apr 8 Apr

6:33 a.m.

New subject: Re[6]: Parallel combinator, performance advice

...

...
...
...
parallel_ (x1:xs) = do sem <- newQSem $ 1 - length xs forM_ xs $ \x -> writeChan queue (x >> signalQSem sem, False) x1 addWorker waitQSem sem writeChan queue (signalQSem sem, True) waitQSem sem

...
...
Neil, executing x1 directly in parallel_ is incorrect idea.

...
forget this. but it still a bit suboptimal...

i think i realized why you use this schema. my solution may lead to N-1 worker threads in the system if last job is too small - after its execution we finish one thread and have just N-1 working threads until parallel_ will be finished

but problem i mentioned in previous letter may also take place although it looks like less important. we may solve both problems by allowing worker thread to actively select its death time: it should die only at the moment when *last* job in bucket was finished - this guarantees us exactly N worker threads at any time. so:

parallel_ (x1:xs) = do sem <- newQSem $ - length xs jobsLast <- newMVar (length xs) addWorker forM_ (x1:xs) $ \x -> do writeChan queue $ do x signalQSem sem modifyMVar jobsLast $ \jobs -> do return (jobs-1, jobs==0)

Yes, this saves us adding a kill job to the queue, but requires an extra MVar. I guess which one of these is to be preferred depends on performance measures. I've just found that QSem _ISN'T_ valid below 0, so the above code won't actually work with QSem as it stands. I've reported this on ghc-users@. I've also spotted that QSem creates plenty of MVar's itself, so the logic of moving to QSem instead of MVar isn't really valid (although the new approach is nicer, so that's good). Thanks Neil

Bulat Ziganshin

6:45 a.m.

New subject: Re[8]: Parallel combinator, performance advice

Hello Neil, Wednesday, April 8, 2009, 2:33:15 PM, you wrote:

...

Yes, this saves us adding a kill job to the queue, but requires an extra MVar. I guess which one of these is to be preferred depends on performance measures.

i think that main problem with your last variant is that kill job added to the queue much later than real jobs. imagine the following scenario: 1. para used to add 100 jobs to the queue 2. while these jobs are executing, para in another thread adds another 100 jobs to the queue in my variant, first para exits just when its own jobs are completed. in your variant, exit job is added to the queue only after second batch of 100 jobs so we will wait until all these jobs are executed -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Neil Mitchell

7:25 a.m.

New subject: Re[8]: Parallel combinator, performance advice

Hi Bulat,

...

...
Yes, this saves us adding a kill job to the queue, but requires an extra MVar. I guess which one of these is to be preferred depends on performance measures.

i think that main problem with your last variant is that kill job added to the queue much later than real jobs. imagine the following scenario:

1. para used to add 100 jobs to the queue 2. while these jobs are executing, para in another thread adds another 100 jobs to the queue

in my variant, first para exits just when its own jobs are completed. in your variant, exit job is added to the queue only after second batch of 100 jobs so we will wait until all these jobs are executed

Ah, you are indeed right. I forgot about the property of timeliness - you want parallel_ to return as soon as it can. Once you're using an MVar to do the killing, you might as well use the same MVar to wait on, and avoid the semaphores, and avoid the bug with negative semaphores: http://hackage.haskell.org/trac/ghc/ticket/3159 I've attached a revised implementation. With my benchmark it gives a stack overflow: n = 1000000 main = do r <- newIORef (0 :: Int) let incRef = atomicModifyIORef r (\a -> let a2 = a + 1 in a2 `seq` (a2,a)) parallel_ $ replicate n $ incRef v <- readIORef r parallelStop print v However, calling with that many items in a parallel queue is highly unlikely, so I'll not worry (if anyone knows why it stack overflows, I am curious). This parallel_ implementation has all the properties I want, so I'm very happy with it. As a side note, it's necessary to add parallelStop, to kill all the threads - or you get thread blocked exceptions being raised. You create a thread pool waiting on a queue as a CAF, but after the last call to parallel_ finishes GHC realises the queue is unreachable. This means that all the threads are blocked indefinitely, so GHC raises exceptions. To solve this we call parallelStop to get all the threads to die nicely. Thanks Neil

Bulat Ziganshin

7:33 a.m.

New subject: Re[10]: Parallel combinator, performance advice

Hello Neil, Wednesday, April 8, 2009, 3:25:51 PM, you wrote:

...

I've attached a revised implementation. With my benchmark it gives a stack overflow:

it may be in replicate. check with let incRef = atomicModifyIORef r (\a -> (a,a))

...

As a side note, it's necessary to add parallelStop, to kill all the threads - or you get thread blocked exceptions being raised.

alternatively, you can catch this exception in addWorker -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Neil Mitchell

8:28 a.m.

New subject: Re[10]: Parallel combinator, performance advice

Hi Bulat,

...

...
I've attached a revised implementation. With my benchmark it gives a stack overflow:

let incRef = atomicModifyIORef r (\a -> (a,a))

That was the problem, I now no longer get a stack overflow.

...

...
As a side note, it's necessary to add parallelStop, to kill all the threads - or you get thread blocked exceptions being raised.

alternatively, you can catch this exception in addWorker

I could, but it wouldn't be that easy to tell if the reason for the deadlock was the program finishing a real deadlock occurring in the system. I'd rather not catch exceptions where possible. Thanks Neil

Gleb Alexeyev

9:12 a.m.

Neil Mitchell wrote:

...

I've just found that QSem _ISN'T_ valid below 0, so the above code won't actually work with QSem as it stands.

I may be missing something, but I guess this code will work if 'newQSem (-n)' is replaced with 'newQSemN n', every 'signalQSem sem' with 'signalQSemN sem 1' and 'waitQSem sem' with 'waitQSemN sem n'.

Neil Mitchell

9:19 a.m.

Hi Gleb,

...

...
I've just found that QSem _ISN'T_ valid below 0, so the above code won't actually work with QSem as it stands.

I may be missing something, but I guess this code will work if 'newQSem (-n)' is replaced with 'newQSemN n', every 'signalQSem sem' with 'signalQSemN sem 1' and 'waitQSem sem' with 'waitQSemN sem n'.

Yep, that seems valid. However, the new solution requires an MVar to store the number of things we're waiting for anyway, so we might as well use that to also wake up the main thread. Semaphores aren't more efficient than MVar's, they are in fact implemented in terms of MVar's, so there is no benefit. Thanks Neil

Bulat Ziganshin

7 Apr 7 Apr

11:40 a.m.

New subject: Re[2]: Parallel combinator, performance advice

Hello Neil, Tuesday, April 7, 2009, 6:13:29 PM, you wrote:

...

Calls to System.Cmd.system, i.e. running external console processes. It's a make system I'm writing, so virtually all the time is spent in calls to ghc etc.

btw, if all that you need is to limit amount of simultaneous System.Cmd.system calls, you may go from opposite side: wrap this call into semaphore: sem = unsafePerformIO$ newQSem numCapabilities mysystem = bracket_ (waitQSem sem) (signalQSem sem) . system and implement para as simple thread population: para = mapM_ forkIO -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Neil Mitchell

11:47 a.m.

New subject: Re[2]: Parallel combinator, performance advice

Hi Bulat,

...

btw, if all that you need is to limit amount of simultaneous System.Cmd.system calls, you may go from opposite side: wrap this call into semaphore:

sem = unsafePerformIO$ newQSem numCapabilities

mysystem = bracket_ (waitQSem sem) (signalQSem sem) . system

and implement para as simple thread population:

para = mapM_ forkIO

My main motivation is to limit the number of system calls, but it's also useful from a user point of view if the system is doing a handful of things at a time - it makes it easier to track what's going on. I might try that tomorrow and see if it makes a difference to the performance. While the majority of computation is in system calls, quite a few of the threads open files etc, and having them all run in parallel would end up with way too many open handles etc. Thanks Neil

Bulat Ziganshin

11:58 a.m.

New subject: Re[4]: Parallel combinator, performance advice

Hello Neil, Tuesday, April 7, 2009, 7:47:17 PM, you wrote:

...

...
para = mapM_ forkIO

...

I might try that tomorrow and see if it makes a difference to the performance. While the majority of computation is in system calls, quite a few of the threads open files etc, and having them all run in parallel would end up with way too many open handles etc.

if you have too much threads, you may replace forkIO with one more QSem-enabled call: semIO = unsafePerformIO$ newQSem 100 myForkIO = bracket_ (waitQSem semIO) (signalQSem semIO) . forkIO this limit may be much higher than for System.Cmd.system or you may go further and replace it with thread pool approach. the main problem behind is raw calls to forkIO since these increases amount of threads capable to call System.Cmd.system without any control from us -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Sebastian Sylvan

11:02 a.m.

This is a random idea, that's probably not going to work, but I don't have a way of testing it so I'll just post it! How about using unsafeInterleaveIO to get a lazy suspension of the result of each action, and then using par to spark off each of them? If that works you can reuse the existing task-parallel system of GHC to do the heavily lifting for you, instead of having to write your own. On Tue, Apr 7, 2009 at 11:25 AM, Neil Mitchell wrote:

...

Hi,

I've written a parallel_ function, code attached. I'm looking for criticism, suggestions etc on how to improve the performance and fairness of this parallel construct. (If it turns out this construct is already in a library somewhere, I'd be interested in that too!)

The problem I'm trying to solve is running system commands in parallel. Importantly (unlike other Haskell parallel stuff) I'm not expecting computationally heavy Haskell to be running in the threads, and only want a maximum of n commands to fire at a time. The way I'm trying to implement this is with a parallel_ function:

parallel_ :: [IO a] -> IO ()

The semantics are that after parallel_ returns each action will have been executed exactly once. The implementation (attached) creates a thread pool of numCapabililties-1 threads, each of which reads from a task pool and attempts to do some useful work. I use an idempotent function to ensure that all work is done at most one, and a sequence_ to ensure all work is done at least once.

Running a benchmark of issuing 1 million trivial tasks (create, modify, read and IO ref) the version without any parallelism is really fast (< 0.1 sec), and the version with parallelism is slow (> 10 sec). This could be entirely due to space leaks etc when queueing many tasks.

I'm useful for any thoughts people might have!

Thanks in advance,

Neil

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862

5889

Age (days ago)

5890

Last active (days ago)

List overview

Download

25 comments

5 participants

participants (5)

Bulat Ziganshin
ChrisK
Gleb Alexeyev
Neil Mitchell
Sebastian Sylvan