Slow mvar when compiled with threaded

I have test network client, something like apache bench tool. It uses mvars to synchronize and everything is ok when compiled without -threaded. real 0m2.995s user 0m0.601s sys 0m2.391s With -threaded compile option result is following: real 0m18.196s user 0m2.054s sys 0m3.313s Seems that program is sleeping most of the time for some reason. I can't explain behavior as it seems that program is ok. It starts `concurrency` threads which wait on mvar to process next task. Program follows: {-# Language OverloadedStrings #-} import System.CPUTime import System.IO --import System.IO.Error import Network.Socket hiding(recv) import Network.Socket.ByteString import System.Environment import Control.Concurrent import Control.Exception main = do n <- getArgs let (host,port,conc,reqs) = parse n putStrLn $ "Connecting to " ++ host ++ " " ++ port s <- getAddrInfo Nothing (Just host) (Just port) let servAddr = head s begin <- getCPUTime process servAddr conc reqs end <- getCPUTime let diff = (fromIntegral (end - begin))/(10^12) :: Double putStrLn $ show (round (fromIntegral reqs / diff)) ++ " r/s" parse [h,p,conc,reqs] = (h,p,read conc::Int,read reqs::Int) parse _ = error "usage client host port concurrency requests" process servAddr conc reqs = do let niter = if reqs >= conc then conc else reqs putStrLn $ "loop " ++ show niter mvars <- initThreads niter [] putStrLn $ "Initialized " ++ show niter let loop n (m:mvs) f | n>0 = do flag <- isEmptyMVar m if f > length mvars then putStrLn "busy" else return () if flag || f > length mvars then do putMVar m () loop (n-1) mvs 0 else loop n mvs (f+1) | otherwise = return () loop n [] f = if n>0 then loop n mvars f else return () putStrLn $ "length " ++ show (length mvars) loop (reqs-niter) mvars 0 where initThreads niter vars | niter > 0 = do mvar <- newMVar () forkIO $ process mvar initThreads (niter-1) (mvar:vars) | otherwise = return vars process mvar = do sock <- socket (addrFamily servAddr) Stream defaultProtocol connect sock (addrAddress servAddr) sendAll sock "Hello World!\n" buf <- recv sock 1024 close sock takeMVar mvar process mvar

Here's a much simpler implementation for that sort of pattern, using channels to fan out work to threads. I added a dependency on Criterion because getCPUTime is basically useless for this kind of measurement on Mac OS X since it doesn't include the time that the process spent waiting on IO: {-# Language OverloadedStrings #-} import System.CPUTime import Network.Socket hiding(recv) import Network.Socket.ByteString import Control.Exception (handle, IOException) import System.Environment import Control.Concurrent import Control.Monad import Data.Either (partitionEithers) import Data.List (intercalate) import Criterion.Measurement (time, time_, secs) main :: IO () main = do (host, port, conc, reqs) <- fmap parse getArgs putStrLn $ "Connecting to " ++ host ++ " " ++ port (servAddr:_) <- getAddrInfo Nothing (Just host) (Just port) (diff, results) <- time $ process servAddr conc reqs let (errs, succs) = partitionEithers results numSuccs = length succs numErrs = length errs succTime = sum succs succAvg = succTime / fromIntegral numSuccs putStrLn $ unwords [show numSuccs, "successes,", show numErrs, "errors in", secs diff] when (numSuccs > 0) $ do putStrLn $ "min/max/avg request time: " ++ intercalate " / " (map secs [minimum succs, maximum succs, succAvg]) putStrLn $ show (round (fromIntegral reqs / diff) :: Int) ++ " r/s" parse :: [String] -> (String, String, Int, Int) parse [h,p,conc,reqs] = (h, p, read conc, read reqs) parse _ = error "usage client host port concurrency requests" process :: AddrInfo -> Int -> Int -> IO [Either IOException Double] process servAddr conc reqs = do reqChan <- newChan ackChan <- newChan let processThread = forever $ do _ <- readChan reqChan handle (return . Left) (fmap Right socketAction) >>= writeChan ackChan socketAction = time_ $ do sock <- socket (addrFamily servAddr) Stream defaultProtocol connect sock (addrAddress servAddr) sendAll sock "GET /\r\n\r\n" void $ recv sock 1024 close sock replicateM_ reqs (writeChan reqChan ()) replicateM_ (min conc reqs) (forkIO processThread) replicateM reqs (readChan ackChan) On Tue, Jan 7, 2014 at 10:55 AM, Branimir Maksimovic < branimir.maksimovic@gmail.com> wrote:
I have test network client, something like apache bench tool. It uses mvars to synchronize and everything is ok when compiled without -threaded. real 0m2.995s user 0m0.601s sys 0m2.391s
With -threaded compile option result is following: real 0m18.196s user 0m2.054s sys 0m3.313s
Seems that program is sleeping most of the time for some reason. I can't explain behavior as it seems that program is ok. It starts `concurrency` threads which wait on mvar to process next task.
Program follows:
{-# Language OverloadedStrings #-} import System.CPUTime import System.IO --import System.IO.Error import Network.Socket hiding(recv) import Network.Socket.ByteString import System.Environment import Control.Concurrent import Control.Exception
main = do n <- getArgs let (host,port,conc,reqs) = parse n putStrLn $ "Connecting to " ++ host ++ " " ++ port s <- getAddrInfo Nothing (Just host) (Just port) let servAddr = head s begin <- getCPUTime process servAddr conc reqs end <- getCPUTime let diff = (fromIntegral (end - begin))/(10^12) :: Double putStrLn $ show (round (fromIntegral reqs / diff)) ++ " r/s"
parse [h,p,conc,reqs] = (h,p,read conc::Int,read reqs::Int) parse _ = error "usage client host port concurrency requests"
process servAddr conc reqs = do let niter = if reqs >= conc then conc else reqs putStrLn $ "loop " ++ show niter mvars <- initThreads niter [] putStrLn $ "Initialized " ++ show niter let loop n (m:mvs) f | n>0 = do flag <- isEmptyMVar m if f > length mvars then putStrLn "busy" else return () if flag || f > length mvars then do putMVar m () loop (n-1) mvs 0 else loop n mvs (f+1) | otherwise = return () loop n [] f = if n>0 then loop n mvars f else return () putStrLn $ "length " ++ show (length mvars) loop (reqs-niter) mvars 0 where initThreads niter vars | niter > 0 = do mvar <- newMVar () forkIO $ process mvar initThreads (niter-1) (mvar:vars) | otherwise = return vars process mvar = do sock <- socket (addrFamily servAddr) Stream defaultProtocol connect sock (addrAddress servAddr) sendAll sock "Hello World!\n" buf <- recv sock 1024 close sock takeMVar mvar process mvar
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Here's a much simpler implementation for that sort of pattern, using channels to fan out work to threads. I added a dependency on Criterion because getCPUTime is basically useless for this kind of measurement on Mac OS X since it doesn't include the time that the process spent waiting on IO: Great, thank you very much. You gave me material for learning ;) However, my version is significantly faster when compiling without -threaded. With -threaded option, your version is much faster than mine, but both are significantly slower
On 01/07/2014 09:39 PM, Bob Ippolito wrote: then compile without -threaded. Thanks!

On Tuesday, January 7, 2014, Branimir Maksimovic wrote:
On 01/07/2014 09:39 PM, Bob Ippolito wrote:
Here's a much simpler implementation for that sort of pattern, using channels to fan out work to threads. I added a dependency on Criterion because getCPUTime is basically useless for this kind of measurement on Mac OS X since it doesn't include the time that the process spent waiting on IO:
Great, thank you very much. You gave me material for learning ;) However, my version is significantly faster when compiling without -threaded. With -threaded option, your version is much faster than mine, but both are significantly slower then compile without -threaded.
Happy to look into it, I didn't have time today to do benchmarks (and Mac OS X is the worst platform to do this kind of testing on regardless of language, its network stack is inconsistent at best). I need to know more: How exactly are you compiling? Which OS? What version of GHC and Haskell Platform? What is the exact command line you execute it with? What timings do you get? What's the code for the server are you connecting to? Loopback, local network, or internet? -bob

On 01/08/2014 04:15 AM, Bob Ippolito wrote:
On Tuesday, January 7, 2014, Branimir Maksimovic wrote:
On 01/07/2014 09:39 PM, Bob Ippolito wrote:
Here's a much simpler implementation for that sort of pattern, using channels to fan out work to threads. I added a dependency on Criterion because getCPUTime is basically useless for this kind of measurement on Mac OS X since it doesn't include the time that the process spent waiting on IO:
Great, thank you very much. You gave me material for learning ;) However, my version is significantly faster when compiling without -threaded. With -threaded option, your version is much faster than mine, but both are significantly slower then compile without -threaded.
Happy to look into it, I didn't have time today to do benchmarks (and Mac OS X is the worst platform to do this kind of testing on regardless of language, its network stack is inconsistent at best). I need to know more:
How exactly are you compiling?
ghc-7.6.3 --make -O2 client.hs
Which OS?
Ubuntu 13.10 with 3.13-rc7 kernel.
What version of GHC and Haskell Platform?
bmaxa@maxa:~$ apt-cache policy haskell-platform haskell-platform: Installed: 2013.2.0.0 Candidate: 2013.2.0.0 Version table: *** 2013.2.0.0 0 500 http://archive.ubuntu.com/ubuntu/ saucy/universe amd64 Packages 100 /var/lib/dpkg/status
What is the exact command line you execute it with?
time ./client maxa 5055 1000 100000
What timings do you get? with your version: real 0m4.235s user 0m1.589s sys 0m2.642s
with my version real 0m3.010s user 0m0.590s sys 0m2.417s that is, of course, without -threaded
What's the code for the server are you connecting to? import Network (listenOn,PortID(..)) import Network.Socket (accept,close) import Network.Socket.ByteString import System.Environment import Control.Concurrent (forkIO)
main = do n <- getArgs let nn = (read.head) n :: Int putStrLn $ "Listening on " ++ show nn sock <- listenOn $ PortNumber $ fromIntegral nn serve sock serve sock = do (s,_) <- accept sock forkIO $ process s serve sock process sock = do buf <- recv sock 1024 sendAll sock buf close sock
Loopback, local network, or internet?
loopback. Testing is on same computer. Thanks!

On Tuesday, January 7, 2014, Branimir Maksimovic wrote:
On 01/08/2014 04:15 AM, Bob Ippolito wrote:
On Tuesday, January 7, 2014, Branimir Maksimovic wrote:
On 01/07/2014 09:39 PM, Bob Ippolito wrote:
Here's a much simpler implementation for that sort of pattern, using channels to fan out work to threads. I added a dependency on Criterion because getCPUTime is basically useless for this kind of measurement on Mac OS X since it doesn't include the time that the process spent waiting on IO:
Great, thank you very much. You gave me material for learning ;) However, my version is significantly faster when compiling without -threaded. With -threaded option, your version is much faster than mine, but both are significantly slower then compile without -threaded.
Happy to look into it, I didn't have time today to do benchmarks (and Mac OS X is the worst platform to do this kind of testing on regardless of language, its network stack is inconsistent at best). I need to know more:
How exactly are you compiling?
ghc-7.6.3 --make -O2 client.hs
Which OS?
Ubuntu 13.10 with 3.13-rc7 kernel.
What version of GHC and Haskell Platform?
bmaxa@maxa:~$ apt-cache policy haskell-platform haskell-platform: Installed: 2013.2.0.0 Candidate: 2013.2.0.0 Version table: *** 2013.2.0.0 0 500 http://archive.ubuntu.com/ubuntu/ saucy/universe amd64 Packages 100 /var/lib/dpkg/status
What is the exact command line you execute it with?
time ./client maxa 5055 1000 100000
What happens if you add +RTS -N to the end of that command line?
What timings do you get?
with your version: real 0m4.235s user 0m1.589s sys 0m2.642s
with my version real 0m3.010s user 0m0.590s sys 0m2.417s that is, of course, without -threaded
What's the code for the server are you connecting to?
import Network (listenOn,PortID(..)) import Network.Socket (accept,close) import Network.Socket.ByteString import System.Environment import Control.Concurrent (forkIO)
main = do n <- getArgs let nn = (read.head) n :: Int putStrLn $ "Listening on " ++ show nn sock <- listenOn $ PortNumber $ fromIntegral nn serve sock
serve sock = do (s,_) <- accept sock forkIO $ process s serve sock
process sock = do buf <- recv sock 1024 sendAll sock buf close sock
Loopback, local network, or internet?
loopback. Testing is on same computer.
Thanks!
Great, I'll look closer later tonight or tomorrow morning.

I haven't yet been able to sort out the performance difference, and
probably won't have time to dig in deeper with profiling tools today. I've
put together a gist with all of the files and a cabal file so it's easy to
build with the right options.
https://gist.github.com/etrepum/8312165
On Tue, Jan 7, 2014 at 7:47 PM, Bob Ippolito
On Tuesday, January 7, 2014, Branimir Maksimovic wrote:
On 01/08/2014 04:15 AM, Bob Ippolito wrote:
On Tuesday, January 7, 2014, Branimir Maksimovic wrote:
On 01/07/2014 09:39 PM, Bob Ippolito wrote:
Here's a much simpler implementation for that sort of pattern, using channels to fan out work to threads. I added a dependency on Criterion because getCPUTime is basically useless for this kind of measurement on Mac OS X since it doesn't include the time that the process spent waiting on IO:
Great, thank you very much. You gave me material for learning ;) However, my version is significantly faster when compiling without -threaded. With -threaded option, your version is much faster than mine, but both are significantly slower then compile without -threaded.
Happy to look into it, I didn't have time today to do benchmarks (and Mac OS X is the worst platform to do this kind of testing on regardless of language, its network stack is inconsistent at best). I need to know more:
How exactly are you compiling?
ghc-7.6.3 --make -O2 client.hs
Which OS?
Ubuntu 13.10 with 3.13-rc7 kernel.
What version of GHC and Haskell Platform?
bmaxa@maxa:~$ apt-cache policy haskell-platform haskell-platform: Installed: 2013.2.0.0 Candidate: 2013.2.0.0 Version table: *** 2013.2.0.0 0 500 http://archive.ubuntu.com/ubuntu/ saucy/universe amd64 Packages 100 /var/lib/dpkg/status
What is the exact command line you execute it with?
time ./client maxa 5055 1000 100000
What happens if you add +RTS -N to the end of that command line?
What timings do you get?
with your version: real 0m4.235s user 0m1.589s sys 0m2.642s
with my version real 0m3.010s user 0m0.590s sys 0m2.417s that is, of course, without -threaded
What's the code for the server are you connecting to?
import Network (listenOn,PortID(..)) import Network.Socket (accept,close) import Network.Socket.ByteString import System.Environment import Control.Concurrent (forkIO)
main = do n <- getArgs let nn = (read.head) n :: Int putStrLn $ "Listening on " ++ show nn sock <- listenOn $ PortNumber $ fromIntegral nn serve sock
serve sock = do (s,_) <- accept sock forkIO $ process s serve sock
process sock = do buf <- recv sock 1024 sendAll sock buf close sock
Loopback, local network, or internet?
loopback. Testing is on same computer.
Thanks!
Great, I'll look closer later tonight or tomorrow morning.

So I dug a bit deeper (trying N different ways to write the code,
ThreadScope, etc.) and I have some good news some bad news.
The good news is that I figured out how to fix it, and it's an "easy" fix.
The bad news is that the fix is to compile it with GHC HEAD (~7.8, I built
ec4af3f), which has the new Mio high-performance multicore IO manager [1].
Apparently the old IO manager wasn't well suited to this use case.
With the latest GHC, the -threaded version outperforms the single threaded
version, and the numbers for both are better than with GHC 7.6.3.
I've updated the gist with a README that has some more details:
https://gist.github.com/etrepum/8312165
[1]
http://haskell.cs.yale.edu/wp-content/uploads/2013/08/hask035-voellmy.pdf
-bob
On Wed, Jan 8, 2014 at 7:50 AM, Bob Ippolito
I haven't yet been able to sort out the performance difference, and probably won't have time to dig in deeper with profiling tools today. I've put together a gist with all of the files and a cabal file so it's easy to build with the right options.
https://gist.github.com/etrepum/8312165
On Tue, Jan 7, 2014 at 7:47 PM, Bob Ippolito
wrote: On Tuesday, January 7, 2014, Branimir Maksimovic wrote:
On 01/08/2014 04:15 AM, Bob Ippolito wrote:
On Tuesday, January 7, 2014, Branimir Maksimovic wrote:
On 01/07/2014 09:39 PM, Bob Ippolito wrote:
Here's a much simpler implementation for that sort of pattern, using channels to fan out work to threads. I added a dependency on Criterion because getCPUTime is basically useless for this kind of measurement on Mac OS X since it doesn't include the time that the process spent waiting on IO:
Great, thank you very much. You gave me material for learning ;) However, my version is significantly faster when compiling without -threaded. With -threaded option, your version is much faster than mine, but both are significantly slower then compile without -threaded.
Happy to look into it, I didn't have time today to do benchmarks (and Mac OS X is the worst platform to do this kind of testing on regardless of language, its network stack is inconsistent at best). I need to know more:
How exactly are you compiling?
ghc-7.6.3 --make -O2 client.hs
Which OS?
Ubuntu 13.10 with 3.13-rc7 kernel.
What version of GHC and Haskell Platform?
bmaxa@maxa:~$ apt-cache policy haskell-platform haskell-platform: Installed: 2013.2.0.0 Candidate: 2013.2.0.0 Version table: *** 2013.2.0.0 0 500 http://archive.ubuntu.com/ubuntu/ saucy/universe amd64 Packages 100 /var/lib/dpkg/status
What is the exact command line you execute it with?
time ./client maxa 5055 1000 100000
What happens if you add +RTS -N to the end of that command line?
What timings do you get?
with your version: real 0m4.235s user 0m1.589s sys 0m2.642s
with my version real 0m3.010s user 0m0.590s sys 0m2.417s that is, of course, without -threaded
What's the code for the server are you connecting to?
import Network (listenOn,PortID(..)) import Network.Socket (accept,close) import Network.Socket.ByteString import System.Environment import Control.Concurrent (forkIO)
main = do n <- getArgs let nn = (read.head) n :: Int putStrLn $ "Listening on " ++ show nn sock <- listenOn $ PortNumber $ fromIntegral nn serve sock
serve sock = do (s,_) <- accept sock forkIO $ process s serve sock
process sock = do buf <- recv sock 1024 sendAll sock buf close sock
Loopback, local network, or internet?
loopback. Testing is on same computer.
Thanks!
Great, I'll look closer later tonight or tomorrow morning.

On 01/10/2014 01:03 AM, Bob Ippolito wrote:
The bad news is that the fix is to compile it with GHC HEAD (~7.8, I built ec4af3f), which has the new Mio high-performance multicore IO manager [1]. Apparently the old IO manager wasn't well suited to this use case.
With the latest GHC, the -threaded version outperforms the single threaded version, and the numbers for both are better than with GHC 7.6.3.
Thanks. Glad to hear that!
participants (2)
-
Bob Ippolito
-
Branimir Maksimovic