Odd parallel haskell observations

Hello. I am exploring haskell features for parallel and cocurrent programming and see something difficult to explain. In brief - asking RTS to use more threads results in awfull drop of performance. And according to 'top' test programm consumes up to N CPUs power. Am I doing something wrong? I attached the code, but I am just issuing thousands of HTTP GET requests in 1-4 forkIO threads. And since it looks like local apache is faster than haskell program (which is a pity) I expected that using more OS threads should improve performance. Just in case: ghc --version The Glorious Glasgow Haskell Compilation System, version 6.12.1 import Data.List import System.IO import qualified System.IO.UTF8 import System.Environment (getArgs) import Network.HTTP import Network.URI import System.Time import System.IO.Unsafe import Control.Monad import Control.Exception import Control.Concurrent import Control.Concurrent.MVar secDiff :: ClockTime -> ClockTime -> Float secDiff (TOD secs1 psecs1) (TOD secs2 psecs2) = fromInteger (psecs2 - psecs1) / 1e12 + fromInteger (secs2 - secs1) -- single get get :: Int -> IO(String) get id = do res <- simpleHTTP $ getRequest "http://127.0.0.1" case res of Left err -> return(show err) Right rsp -> return(show $ rspCode rsp) -- perform GET per each list element using c threads doList :: [Int] -> Int -> IO() doList ids 0 = return() doList [] c = return() doList ids c = do forkChild $ forM_ todo get doList later (c-1) where (todo, later) = splitAt (length ids `div` c) ids {- Copied from http://haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/Control-Concu... Terminating the program -} children :: MVar [MVar ()] children = unsafePerformIO (newMVar []) waitForChildren :: IO () waitForChildren = do cs <- takeMVar children case cs of [] -> return () m:ms -> do putMVar children ms takeMVar m waitForChildren forkChild :: IO () -> IO ThreadId forkChild io = do mvar <- newEmptyMVar childs <- takeMVar children putMVar children (mvar:childs) forkIO (io `finally` putMVar mvar ()) -- end of copied code main = do [c', n'] <- getArgs let c = read c' :: Int n = read n' :: Int start <- getClockTime doList [1..n] c waitForChildren end <- getClockTime putStrLn $ show(c) ++ " " ++ show(secDiff start end) ++ "s" 20:31 sacha@loft4633:/tmp 21> ghc --make -threaded get.hs [1 of 1] Compiling Main ( get.hs, get.o ) Linking get ... 20:31 sacha@loft4633:/tmp 22> ./get 1 10000 1 3.242352s 20:31 sacha@loft4633:/tmp 23> ./get 2 10000 2 3.08306s 20:31 sacha@loft4633:/tmp 24> ./get 2 10000 +RTS -N2 2 6.898871s 20:32 sacha@loft4633:/tmp 25> ./get 3 10000 3 2.950677s 20:32 sacha@loft4633:/tmp 26> ./get 3 10000 +RTS -N2 3 7.381678s 20:32 sacha@loft4633:/tmp 27> ./get 3 10000 +RTS -N3 3 14.683548s 20:32 sacha@loft4633:/tmp 28> ./get 4 10000 4 3.332165s 20:33 sacha@loft4633:/tmp 29> ./get 4 10000 +RTS -N4 -s ./get 4 10000 +RTS -N4 -s 4 57.17923s 2,147,969,912 bytes allocated in the heap 49,059,288 bytes copied during GC 736,656 bytes maximum residency (98 sample(s)) 486,744 bytes maximum slop 5 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 949 collections, 948 parallel, 76.73s, 25.67s elapsed Generation 1: 98 collections, 98 parallel, 7.70s, 2.56s elapsed Parallel GC work balance: 2.17 (6115428 / 2822692, ideal 4) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 1.43s ( 27.76s) 6.31s ( 2.12s) Task 1 (worker) : 0.00s ( 28.13s) 10.62s ( 3.56s) Task 2 (worker) : 0.37s ( 28.63s) 11.06s ( 3.69s) Task 3 (worker) : 0.00s ( 28.95s) 6.29s ( 2.10s) Task 4 (worker) : 20.73s ( 28.95s) 9.68s ( 3.24s) Task 5 (worker) : 0.00s ( 28.95s) 0.60s ( 0.20s) Task 6 (worker) : 21.81s ( 28.95s) 11.91s ( 3.97s) Task 7 (worker) : 18.59s ( 28.95s) 13.04s ( 4.36s) Task 8 (worker) : 17.24s ( 28.96s) 14.92s ( 4.99s) SPARKS: 0 (0 converted, 0 pruned) INIT time 0.00s ( 0.00s elapsed) MUT time 79.23s ( 28.95s elapsed) GC time 84.43s ( 28.23s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 162.49s ( 57.19s elapsed) %GC time 52.0% (49.4% elapsed) Alloc rate 27,513,782 bytes per MUT second Productivity 48.0% of total user, 136.5% of total elapsed gc_alloc_block_sync: 15006 whitehole_spin: 0 gen[0].steps[0].sync_large_objects: 7617 gen[0].steps[1].sync_large_objects: 35 gen[1].steps[0].sync_large_objects: 1400

Hello once again. So I continue to issue thousands of HTTP GET requests to a local apache an got some ThreadScope pictures and numbers (BTW, I all this happens on a 4-core machine). I would point out the following as deserving an explanation: 1. It looks like that none of tests used resources of more than 2 cores. 2. Sometimes there is no activity in any thread of a running program (get.N4qg_withgaps.eventlog.png, does this mean that process is in a OS queue for scheduling or something else?) 3. Without RTS's -c or -qg multithreaded run suffers from excessive GC actions. 4. Even with -c/-qg thread's run looks to be iterrupted too frequent. 5. Provided that 10000 requests in a row can be completed in ~3.4s I would expect that 4 threads might come close or even under 1s, but 1.9s was the best result. 10000 requests sequentially: 18:02 sacha@loft4633:/tmp 81> for i in `seq 1 5 `;do ./get 1 10000;done 1 3.320623s 1 3.381713s 1 3.394865s 1 3.470693s 1 3.336757s Using 3 thread and no RTS support 18:20 sacha@loft4633:/tmp 101> for i in `seq 1 5 `;do ./get 3 10000 ;done 3 3.055638s 3 3.226093s 3 3.079507s 3 3.087109s 3 3.279543s Using 3 threads with RTS support 17:58 sacha@loft4633:/tmp 80> for i in `seq 1 5 `;do ./get 3 10000 +RTS -N3;done 3 10.971925s 3 9.805483s 3 10.644644s 3 10.734496s 3 9.008394s Using 3 threads with RTS -c 17:58 sacha@loft4633:/tmp 79> for i in `seq 1 5 `;do ./get 3 10000 +RTS -N3 -c;done 3 2.198638s 3 2.203567s 3 2.190789s 3 2.258964s 3 2.21575s 4 threads with RTS -c 18:22 sacha@loft4633:/tmp 103> for i in `seq 1 5 `;do ./get 4 10000 +RTS -N4 -c;done 4 1.998448s 4 1.970375s 4 2.016234s 4 1.946955s 4 1.983241s 4 thread disabled parallel GC 18:35 sacha@loft4633:/tmp 147> for i in `seq 1 5 `;do ./get 4 10000 +RTS -N4 -qg ;done 4 2.511088s 4 1.939382s 4 1.888141s 4 1.916858s 4 1.913141s The worst results are achieved for plain -N4 18:43 sacha@loft4633:/tmp 157> for i in `seq 1 5 `;do ./get 4 10000 +RTS -N4;done 4 20.043056s 4 17.02037s 4 16.00271s 4 17.061253s 4 15.914139s

Hello, I am no expert in web server tuning, but I will share my thoughts about your approach and expectations nevertheless. On 08.08.2010, at 21:07, Alexander Kotelnikov wrote:
So I continue to issue thousands of HTTP GET requests to a local apache an got some ThreadScope pictures and numbers (BTW, I all this happens on a 4-core machine).
So your apache configuration is very crucial for the performance figures you will receive from your "benchmark". As far as I know web server benchmarks usually run continuously and report a current throughput or average latency of the last n seconds or something like that. This allows the tested web server to adapt to the kind of load it experiences. And the benchmarker is able to wait untill those numbers stabilize. When you execute your program with a different number of capabilities (different -N settings), apache will see a different kind of load and behave different. This makes it hard to change your program and expect similar results.
I would point out the following as deserving an explanation: 1. It looks like that none of tests used resources of more than 2 cores.
This might be an indication that cpu resources are not the limiting factor in this benchmark. You basically bench the io capabilities of your operating system, your installed apache with your configuration and of your installed ghc. Therefore, increasing available cpu resources does't lead necessarily to increased performance.
2. Sometimes there is no activity in any thread of a running program (get.N4qg_withgaps.eventlog.png, does this mean that process is in a OS queue for scheduling or something else?)
3. Without RTS's -c or -qg multithreaded run suffers from excessive GC actions. 4. Even with -c/-qg thread's run looks to be iterrupted too frequent. 5. Provided that 10000 requests in a row can be completed in ~3.4s I would expect that 4 threads might come close or even under 1s, but 1.9s was the best result.
A last point to consider: Is getRequest strict? Does it internally use some kind of lazy IO? Is it possible that some resource aren't properly freed? Perhaps, because the library relies on the garbage collector reclaiming sockets? Or because the request aren't completely read? I simply don't no the internall of Network.HTTP, but if it uses lazy IO it is IMHO not suitable for such a benchmark. Just, my two euro cents. -- Jean

Hi.
On Mon, 9 Aug 2010 09:44:00 +0200 "JG" == Jean-Marie Gaillourdet
wrote: JG> JG> I am no expert in web server tuning, but I will share my thoughts JG> about your approach and expectations nevertheless.
I would better think about ghc than about web server. I believe, that numbers I already provided (especially their deviation) illustrate that sometimes ghc runtime perform quite bad. I also found out that it can do better than it does by default, I can accept that runtime might be not capable to adjust itself to any taks the best way, but then it will be nice to know, for instance, why for such I/O that GC settings change performance in times. To illustrate even more that httpd has nothing to do with the phenomenon I wrote a small C application which does pretty much the same. Numbers tells that apache can serve much faster that it was required to please haskell version (I will remind that it was 3.4 for single-threaded haskell and went as low as 1.9s for 4-threaded). I attached the code just in case. 10:27 sacha@loft4633:~/work/cube-server/tests 99> for i in 1 2 3 4; do for j in `seq 1 5`;do ./getc $i 10000;done;done 1 1.352978 1 1.348888 1 1.344545 1 1.345116 1 1.189060 2 0.668219 2 0.625113 2 0.698073 2 0.732621 2 0.722310 3 0.569121 3 0.581570 3 0.563512 3 0.566186 3 0.564232 4 0.510132 4 0.496181 4 0.529212 4 0.504506 4 0.511847

sacha:
Hi.
On Mon, 9 Aug 2010 09:44:00 +0200 "JG" == Jean-Marie Gaillourdet
wrote: JG> JG> I am no expert in web server tuning, but I will share my thoughts JG> about your approach and expectations nevertheless. I would better think about ghc than about web server. I believe, that numbers I already provided (especially their deviation) illustrate that sometimes ghc runtime perform quite bad.
I also found out that it can do better than it does by default, I can accept that runtime might be not capable to adjust itself to any taks the best way, but then it will be nice to know, for instance, why for such I/O that GC settings change performance in times.
I'd consider boiling this down into a small test case and asking Simon Marlow -- runtime hacker uber-guru -- to help. -- Don

GHC seems to have a few bottlenecks once you start to really stress-test its I/O performance. Using a newer HEAD ghc actually gives less awful performance: scpmw@cslin209 ~/test $ ghc --make -O2 -threaded -rtsopts get.hs [1 of 1] Compiling Main ( get.hs, get.o ) Linking get ... scpmw@cslin209 ~/test $ ./get 1 10000 1 6.25696s scpmw@cslin209 ~/test $ ./get 1 10000 1 5.409605s scpmw@cslin209 ~/test $ ./get 2 10000 2 3.827393s scpmw@cslin209 ~/test $ ./get 2 10000 +RTS -N2 2 4.274985s scpmw@cslin209 ~/test $ ./get 3 10000 3 3.692725s scpmw@cslin209 ~/test $ ./get 3 10000 +RTS -N2 3 4.186283s scpmw@cslin209 ~/test $ ./get 3 10000 +RTS -N3 3 4.303649s That it still does not speed up might be the result of Haskell's internal implementation of non-blocking I/O. If I understand the situation correctly, all events are actually passing through one OS thread (the I/O manager) right now. That would explain nicely why you can't get more than single-core performance out of your program. In case you are interested in the details, search for the new "Scalable Event Handling for GHC" paper by Bryan O'Sullivan and Johan Tibell. Greetings, Peter Wortmann
participants (4)
-
Alexander Kotelnikov
-
Don Stewart
-
Jean-Marie Gaillourdet
-
Peter Wortmann