
Joel Reymont wrote:
I don't think CPU usage is the issue. An individual thread will take a fraction of a second to deserialize a large packet.
It's a combination of CPU usage by the pickler and GC load. Those 50k packets take 0.03 seconds to unpickle (version of unstuff.hs to measure that is attached). With 100 threads running, even with a completely fair scheduler the time taken for one thread to unpickle that packet is going to be 3 seconds. If optimisation is turned on, the time to unpickle that packet goes down to 0.007 seconds (on my machine). So, it should be more like 400 threads before a fair scheduler would run into problems, but that doesn't take into account GC load, which increases with more threads running, so in fact you still run into trouble with 50 threads. You can reduce the GC load using RTS options: eg. +RTS -H256m and that will reduce the number of timeouts you get. GHC's scheduler may not be completely fair, but I haven't found anything gratuitous in my investigations. Sometimes the time between context switches is more like 0.04 seconds instead of 0.02 seconds, and I still don't understand exactly why, but that's not a serious issue. Somtimes a thread is unlucky enough to have to do a major GC during its timeslice, so it doesn't get its fair share of CPU, but the effect is random and therefore amortized (this isn't a realtime system, after all). What can you do about this? (a) improve performance of the unpickler. As you say, a 2x boost here will double the number of threads you can run in parallel in the time limits you have. (b) try to reduce your heap residency, which will reduce GC load which again means you can run more threads in parallel without hitting your limits. (b) try to manage your latency better, by limiting the number of threads that try to unpickle in parallel. You may reduce the GC load this way, too. Or failing that, just get faster hardware. Or more CPUs, and use GHC's new SMP support. I'm surprised if your real application is this CPU-bound, though. The network communication latency should mean you can run a lot more threads, provided you can improve that pickler so it isn't the bottleneck. I don't have a lot of time to investigate the unpickling code in detail, but I have worked on similar problems in the past and I know that the unpickler in GHC is very fast, for example. It is derived from the original nhc98 interface, with tweaks by me to improve performance, and later NewBinary was derived from it. I haven't measured NewBinary's performance relative to GHC's Binary library, but I don't expect there to be much difference. Cheers, Simon module Main where import System.IO import System.Time import System.Environment import Control.Monad import Control.Concurrent import Control.Exception import Foreign import Pickle import Endian import Util import ZLib import Records import Prelude hiding (read) import Text.Printf main = do args <- getArgs process (head args) 100 waitToFinish {-# NOINLINE lock #-} lock :: MVar () lock = unsafePerformIO $ newMVar () trace s = withMVar lock $ const $ putStrLn s process _ 0 = return () process file n = do h <- openBinaryFile file ReadMode forkChild $ read_ h process file (n - 1) read_ :: Handle -> IO () read_ h = do cmd <- read h (\_ -> return ()) -- lots of ALERTs -- you should not get any alerts if you pass in trace -- below and comment the line above. the lock synch seems -- to have a magical effect -- cmd <- read h trace eof <- hIsEOF h unless eof $ read_ h read :: Handle -> (String -> IO ()) -> IO Command read h trace = do TOD time1 _ <- getClockTime allocaBytes 4 $ \p1 -> do hGetBuf h p1 4 TOD time2 _ <- getClockTime (size', _) <- unpickle endian32 p1 0 TOD time3 _ <- getClockTime let size = fromIntegral $ size' - 4 allocaBytes size $ \packet -> do TOD time4 _ <- getClockTime hGetBuf h packet size TOD time5 _ <- getClockTime cmd <- unstuff packet 0 size TOD time6 _ <- getClockTime trace $ "read: " ++ cmdDesc cmd ++ ": " ++ show (time6 - time1) ++ "s: " ++ show (time2 - time1) ++ "s, " ++ show (time3 - time2) ++ "s, " ++ show (time4 - time3) ++ "s, " ++ show (time5 - time4) ++ "s, " ++ show (time6 - time5) ++ "s" when (time6 - time5 > 3) $ fail $ "RED ALERT: time: " ++ show (time6 - time5) ++ "s, size: " ++ show size' ++ ", cmd: " ++ cmdDesc cmd return $! cmd psecdiff :: ClockTime -> ClockTime -> Integer psecdiff (TOD secs1 psecs1) (TOD secs2 psecs2) = psecs2 - psecs1 + (secs2*10^12 - secs1*10^12) unstuff :: Ptr Word8 -> Int -> Int -> IO Command unstuff ptr ix size = do t1@(TOD time1 _) <- getClockTime (kind, ix1) <- unpickle puCmdType ptr ix t2@(TOD time2 _) <- getClockTime -- when (size > 40000) $ hPutStrLn stderr "unpickle start" (cmd', _) <- unpickle (puCommand kind) ptr ix1 t3@(TOD time3 _) <- getClockTime -- let d = psecdiff t1 t3 -- (secs,psecs) = d `quotRem` (10^12) -- hPrintf stdout "size: %5d, time: %3d.%06d\n" size secs (psecs `quot` 10^6) when (time3 - time1 > 3) $ fail $ "ORANGE ALERT: " ++ show (time2 - time1) ++ "s, " ++ show (time3 - time2) ++ "s, " ++ cmdDesc cmd' ++ ", ix1: " ++ show ix1 ++ ", size: " ++ show size case cmd' of SrvCompressedCommands sz (bytes, ix, src_sz) -> do TOD time1 _ <- getClockTime let sz' = fromIntegral sz allocaBytes sz' $ \dest -> do n <- uncompress (plusPtr bytes ix) src_sz dest sz' TOD time2 _ <- getClockTime when (time2 - time1 > 3) $ fail $ "YELLOW ALERT: time: " ++ show (time2 - time1) ++ "s, size: " ++ show sz ++ ", array: " ++ show bytes cmd'' <- unstuff dest 4 n return $! cmd'' _ -> return cmd'