[Haskell-cafe] Re: Joels Time Leak

4 Jan 2006

      Joel Reymont wrote:
...
I don't think CPU usage is the issue. An individual thread will take  a 
fraction of a second to deserialize a large packet.
It's a combination of CPU usage by the pickler and GC load.

Those 50k packets take 0.03 seconds to unpickle (version of unstuff.hs
to measure that is attached).  With 100 threads running, even with a
completely fair scheduler the time taken for one thread to unpickle that
packet is going to be 3 seconds.

If optimisation is turned on, the time to unpickle that packet goes down
to 0.007 seconds (on my machine).  So, it should be more like 400
threads before a fair scheduler would run into problems, but that
doesn't take into account GC load, which increases with more threads
running, so in fact you still run into trouble with 50 threads.  You can
reduce the GC load using RTS options: eg. +RTS -H256m and that will
reduce the number of timeouts you get.

GHC's scheduler may not be completely fair, but I haven't found anything
gratuitous in my investigations.  Sometimes the time between context
switches is more like 0.04 seconds instead of 0.02 seconds, and I still
don't understand exactly why, but that's not a serious issue.  Somtimes
a thread is unlucky enough to have to do a major GC during its
timeslice, so it doesn't get its fair share of CPU, but the effect is
random and therefore amortized (this isn't a realtime system, after all).

What can you do about this?

    (a) improve performance of the unpickler.  As you say, a 2x boost
        here will double the number of threads you can run in parallel
        in the time limits you have.

    (b) try to reduce your heap residency, which will reduce GC load
        which again means you can run more threads in parallel without
        hitting your limits.

    (b) try to manage your latency better, by limiting the number of
        threads that try to unpickle in parallel.  You may reduce
        the GC load this way, too.

Or failing that, just get faster hardware.  Or more CPUs, and use GHC's
new SMP support.

I'm surprised if your real application is this CPU-bound, though.  The
network communication latency should mean you can run a lot more
threads, provided you can improve that pickler so it isn't the bottleneck.

I don't have a lot of time to investigate the unpickling code in detail,
but I have worked on similar problems in the past and I know that the
unpickler in GHC is very fast, for example.  It is derived from the
original nhc98 interface, with tweaks by me to improve performance, and
later NewBinary was derived from it.  I haven't measured NewBinary's
performance relative to GHC's Binary library, but I don't expect there
to be much difference.

Cheers,
	Simon

module Main where

import System.IO
import System.Time
import System.Environment
import Control.Monad
import Control.Concurrent
import Control.Exception
import Foreign
import Pickle
import Endian
import Util
import ZLib
import Records
import Prelude hiding (read)
import Text.Printf

main = 
    do args <- getArgs
       process (head args) 100
       waitToFinish

{-# NOINLINE lock #-}
lock :: MVar ()
lock = unsafePerformIO $ newMVar ()

trace s = withMVar lock $ const $ putStrLn s

process _ 0 = return ()
process file n = 
    do h <- openBinaryFile file ReadMode
       forkChild $ read_ h 
       process file (n - 1)

read_ :: Handle -> IO ()
read_ h = 
    do cmd <- read h (\_ -> return ()) -- lots of ALERTs
       -- you should not get any alerts if you pass in trace
       -- below and comment the line above. the lock synch seems 
       -- to have a magical effect
       -- cmd <- read h trace
       eof <- hIsEOF h
       unless eof $ read_ h 

read :: Handle -> (String -> IO ()) -> IO Command
read h trace =
    do TOD time1 _ <- getClockTime 
       allocaBytes 4 $ \p1 ->
           do hGetBuf h p1 4
              TOD time2 _ <- getClockTime 
              (size', _) <- unpickle endian32 p1 0
              TOD time3 _ <- getClockTime 
              let size = fromIntegral $ size' - 4
              allocaBytes size $ \packet -> 
                  do TOD time4 _ <- getClockTime 
                     hGetBuf h packet size
                     TOD time5 _ <- getClockTime 
                     cmd <- unstuff packet 0 size
                     TOD time6 _ <- getClockTime 
                     trace $ "read: " ++ cmdDesc cmd ++ ": " 
                               ++ show (time6 - time1) ++ "s: "
                               ++ show (time2 - time1) ++ "s, "
                               ++ show (time3 - time2) ++ "s, "
                               ++ show (time4 - time3) ++ "s, "
                               ++ show (time5 - time4) ++ "s, "
                               ++ show (time6 - time5) ++ "s"
                     when (time6 - time5 > 3) $
                          fail $ "RED ALERT: time: " ++ show (time6 - time5) 
                                   ++ "s, size: " ++ show size' 
                                   ++ ", cmd: " ++ cmdDesc cmd
                     return $! cmd

psecdiff :: ClockTime -> ClockTime -> Integer
psecdiff (TOD secs1 psecs1) (TOD secs2 psecs2)
  = psecs2 - psecs1 + (secs2*10^12 - secs1*10^12)

unstuff :: Ptr Word8 -> Int -> Int -> IO Command
unstuff ptr ix size = 
    do t1@(TOD time1 _) <- getClockTime 
       (kind, ix1) <- unpickle puCmdType ptr ix
       t2@(TOD time2 _) <- getClockTime 
--       when (size > 40000) $ hPutStrLn stderr "unpickle start" 
       (cmd', _) <- unpickle (puCommand kind) ptr ix1
       t3@(TOD time3 _) <- getClockTime 
--       let d = psecdiff t1 t3
--      	   (secs,psecs) = d `quotRem` (10^12)
--       hPrintf stdout "size: %5d, time: %3d.%06d\n" size secs (psecs `quot` 10^6)
       when (time3 - time1 > 3) $
            fail $ "ORANGE ALERT: " ++ show (time2 - time1) 
                     ++ "s, " ++ show (time3 - time2) ++ "s, "
                     ++ cmdDesc cmd' ++ ", ix1: " ++ show ix1
                     ++ ", size: " ++ show size 
       case cmd' of 
         SrvCompressedCommands sz (bytes, ix, src_sz) -> 
             do TOD time1 _ <- getClockTime 
                let sz' = fromIntegral sz
                allocaBytes sz' $ \dest -> 
                    do n <- uncompress (plusPtr bytes ix) src_sz dest sz'
                       TOD time2 _ <- getClockTime 
                       when (time2 - time1 > 3) $
                            fail $ "YELLOW ALERT: time: " 
                                     ++ show (time2 - time1) 
                                     ++ "s, size: " ++ show sz ++ ", array: "
                                     ++ show bytes
                       cmd'' <- unstuff dest 4 n
                       return $! cmd''
         _ -> return cmd'

[Haskell-cafe] Re: Joels Time Leak

Simon Marlow