I recently ran into a problem writing a program that I hoped I could speed up by running over multiple CPUs.  I want non-haskell users to be able to run the tool and take advantage of multiple CPUs.  *But* there is a serious slowdown when the RTS is run with -N and some of the CPUs are already busy.

This is already mentioned in the GHC docs, but the problem I experienced was more serious than I expected causing a slowdown of around 2x compared to running with a single CPU. 

I reproduced the problem with the following code from the haskell wiki:

{-# LANGUAGE BangPatterns #-}
    import Data.Digest.Pure.MD5
    import qualified Data.ByteString.Lazy as L
    import System.Environment
    import Control.Concurrent
    import Control.Monad (replicateM_)
 
    main = do
        files <- getArgs
        str <- newEmptyMVar
        mapM_ (forkIO . hashAndPrint str) files
        printNrResults (length files) str
 
    printNrResults i var = replicateM_ i (takeMVar var >>= putStrLn)
 
    hashAndPrint str f = do
        bs <- L.readFile f
        let !h = show $ md5 bs
        putMVar str (f ++ ": " ++ h)


When run on 4 idle CPU cores, I get the following wall clock times:
  ./run +RTS -N1   : 20.4 sec
  ./run +RTS -N2   : 11.0 sec
  ./run +RTS -N4   : 6.7 sec

When run on the same 4 core machine, but with 2 cores already busy:
  ./run +RTS -N1   : 23.5 sec
  ./run +RTS -N2   : 14.1 sec
  ./run +RTS -N4   : 57.8 sec   <---- Blowout...

This is quite a problem in practice when running on a shared server.  Is there anything that can be done to address this?

(I wrote up a few more details here: http://thunking.drp.id.au/2012/06/slowdown-with-ghc-when-using-multiple.html)

Thanks,

--
David Powell