
Hello. I am exploring haskell features for parallel and cocurrent programming and see something difficult to explain. In brief - asking RTS to use more threads results in awfull drop of performance. And according to 'top' test programm consumes up to N CPUs power. Am I doing something wrong? I attached the code, but I am just issuing thousands of HTTP GET requests in 1-4 forkIO threads. And since it looks like local apache is faster than haskell program (which is a pity) I expected that using more OS threads should improve performance. Just in case: ghc --version The Glorious Glasgow Haskell Compilation System, version 6.12.1 import Data.List import System.IO import qualified System.IO.UTF8 import System.Environment (getArgs) import Network.HTTP import Network.URI import System.Time import System.IO.Unsafe import Control.Monad import Control.Exception import Control.Concurrent import Control.Concurrent.MVar secDiff :: ClockTime -> ClockTime -> Float secDiff (TOD secs1 psecs1) (TOD secs2 psecs2) = fromInteger (psecs2 - psecs1) / 1e12 + fromInteger (secs2 - secs1) -- single get get :: Int -> IO(String) get id = do res <- simpleHTTP $ getRequest "http://127.0.0.1" case res of Left err -> return(show err) Right rsp -> return(show $ rspCode rsp) -- perform GET per each list element using c threads doList :: [Int] -> Int -> IO() doList ids 0 = return() doList [] c = return() doList ids c = do forkChild $ forM_ todo get doList later (c-1) where (todo, later) = splitAt (length ids `div` c) ids {- Copied from http://haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/Control-Concu... Terminating the program -} children :: MVar [MVar ()] children = unsafePerformIO (newMVar []) waitForChildren :: IO () waitForChildren = do cs <- takeMVar children case cs of [] -> return () m:ms -> do putMVar children ms takeMVar m waitForChildren forkChild :: IO () -> IO ThreadId forkChild io = do mvar <- newEmptyMVar childs <- takeMVar children putMVar children (mvar:childs) forkIO (io `finally` putMVar mvar ()) -- end of copied code main = do [c', n'] <- getArgs let c = read c' :: Int n = read n' :: Int start <- getClockTime doList [1..n] c waitForChildren end <- getClockTime putStrLn $ show(c) ++ " " ++ show(secDiff start end) ++ "s" 20:31 sacha@loft4633:/tmp 21> ghc --make -threaded get.hs [1 of 1] Compiling Main ( get.hs, get.o ) Linking get ... 20:31 sacha@loft4633:/tmp 22> ./get 1 10000 1 3.242352s 20:31 sacha@loft4633:/tmp 23> ./get 2 10000 2 3.08306s 20:31 sacha@loft4633:/tmp 24> ./get 2 10000 +RTS -N2 2 6.898871s 20:32 sacha@loft4633:/tmp 25> ./get 3 10000 3 2.950677s 20:32 sacha@loft4633:/tmp 26> ./get 3 10000 +RTS -N2 3 7.381678s 20:32 sacha@loft4633:/tmp 27> ./get 3 10000 +RTS -N3 3 14.683548s 20:32 sacha@loft4633:/tmp 28> ./get 4 10000 4 3.332165s 20:33 sacha@loft4633:/tmp 29> ./get 4 10000 +RTS -N4 -s ./get 4 10000 +RTS -N4 -s 4 57.17923s 2,147,969,912 bytes allocated in the heap 49,059,288 bytes copied during GC 736,656 bytes maximum residency (98 sample(s)) 486,744 bytes maximum slop 5 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 949 collections, 948 parallel, 76.73s, 25.67s elapsed Generation 1: 98 collections, 98 parallel, 7.70s, 2.56s elapsed Parallel GC work balance: 2.17 (6115428 / 2822692, ideal 4) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 1.43s ( 27.76s) 6.31s ( 2.12s) Task 1 (worker) : 0.00s ( 28.13s) 10.62s ( 3.56s) Task 2 (worker) : 0.37s ( 28.63s) 11.06s ( 3.69s) Task 3 (worker) : 0.00s ( 28.95s) 6.29s ( 2.10s) Task 4 (worker) : 20.73s ( 28.95s) 9.68s ( 3.24s) Task 5 (worker) : 0.00s ( 28.95s) 0.60s ( 0.20s) Task 6 (worker) : 21.81s ( 28.95s) 11.91s ( 3.97s) Task 7 (worker) : 18.59s ( 28.95s) 13.04s ( 4.36s) Task 8 (worker) : 17.24s ( 28.96s) 14.92s ( 4.99s) SPARKS: 0 (0 converted, 0 pruned) INIT time 0.00s ( 0.00s elapsed) MUT time 79.23s ( 28.95s elapsed) GC time 84.43s ( 28.23s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 162.49s ( 57.19s elapsed) %GC time 52.0% (49.4% elapsed) Alloc rate 27,513,782 bytes per MUT second Productivity 48.0% of total user, 136.5% of total elapsed gc_alloc_block_sync: 15006 whitehole_spin: 0 gen[0].steps[0].sync_large_objects: 7617 gen[0].steps[1].sync_large_objects: 35 gen[1].steps[0].sync_large_objects: 1400