RE: [Haskell-cafe] Re: Joels Time Leak

On 03 January 2006 15:37, Sebastian Sylvan wrote:
On 1/3/06, Simon Marlow
wrote: Tomasz Zielonka wrote:
On Thu, Dec 29, 2005 at 01:20:41PM +0000, Joel Reymont wrote:
Why does it take a fraction of a second for 1 thread to unpickle and several seconds per thread for several threads to do it at the same time? I think this is where the mistery lies.
Have you considered any of this:
- too big memory pressure: more memory means more frequent and more expensive GCs, 1000 threads using so much memory means bad cache performance - a deficiency of GHC's thread scheduler - giving too much time one thread steals it from others (Simons, don't get angry at me - I am probably wrong here ;-)
I don't think there's anything really strange going on here.
The default context switch interval in GHC is 0.02 seconds, measured in CPU time by default. GHC's scheduler is stricly round-robin, so therefore with 100 threads in the system it can be 2 seconds between a thread being descheduled and scheduled again.
According to this: http://www.haskell.org/ghc/docs/latest/html/users_guide/sec-using-parallel.h...
The minimum time between context switches is 20 milliseconds.
Is there any good reason why 0.02 seconds is the best that you can get here? Couldn't GHC's internal timer tick at a _much_ faster rate (like 50-100µs or so)?
Sure, there's no reason why we couldn't do this. Of course, even idle Haskell processes will be ticking away in the background, so there's a reason not to make the interval too short. What do you think is reasonable?
Apart from meaning big trouble for applications with a large number of threads (such as Joels) it'll also make life difficult for any sort of real-time application. For instance if you want to use HOpenGL to render a simulation engine and you split it up into tons of concurrent processes (say one for each dynamic entity in the engine), the 20ms granularity would make it quite hard to achieve 60 frames per second in that case...
The reason things are the way they are is that a large number of *running* threads is not a workload we've optimised for. In fact, Joel's program is the first one I've seen with a lot of running threads, apart from our testsuite. And I suspect that when Joel uses a better binary I/O implementation a lot of that CPU usage will disappear. Cheers, Simon

On 1/3/06, Simon Marlow
On 03 January 2006 15:37, Sebastian Sylvan wrote:
On 1/3/06, Simon Marlow
wrote: Tomasz Zielonka wrote:
On Thu, Dec 29, 2005 at 01:20:41PM +0000, Joel Reymont wrote:
Why does it take a fraction of a second for 1 thread to unpickle and several seconds per thread for several threads to do it at the same time? I think this is where the mistery lies.
Have you considered any of this:
- too big memory pressure: more memory means more frequent and more expensive GCs, 1000 threads using so much memory means bad cache performance - a deficiency of GHC's thread scheduler - giving too much time one thread steals it from others (Simons, don't get angry at me - I am probably wrong here ;-)
I don't think there's anything really strange going on here.
The default context switch interval in GHC is 0.02 seconds, measured in CPU time by default. GHC's scheduler is stricly round-robin, so therefore with 100 threads in the system it can be 2 seconds between a thread being descheduled and scheduled again.
According to this: http://www.haskell.org/ghc/docs/latest/html/users_guide/sec-using-parallel.h...
The minimum time between context switches is 20 milliseconds.
Is there any good reason why 0.02 seconds is the best that you can get here? Couldn't GHC's internal timer tick at a _much_ faster rate (like 50-100µs or so)?
Sure, there's no reason why we couldn't do this. Of course, even idle Haskell processes will be ticking away in the background, so there's a reason not to make the interval too short. What do you think is reasonable?
Not sure. Could it be configurable via a command line flag? If the profiler could report the % of time spent doing context switches (or maybe it already does?) the user could fine tune this to his liking. For the (hypothetical) real-time simulation app I would *guess* that something along the lines of 500µs would be more than enough to not introduce any unnecessary lag in rendering (seeing as the target frame time would be around 15ms, and you'd want to have a good amount of context switches to allow some of the next frame to be computed in parallell to all the render-surface optimizations etc. for the current frame). But then again, there may be other apps which need it to be even lower.. So a command line flag sure would be nice. /S -- Sebastian Sylvan +46(0)736-818655 UIN: 44640862

Simon, I don't think CPU usage is the issue. An individual thread will take a fraction of a second to deserialize a large packet. The issue is that, as you pointed out, you can get alerts even with 50 threads. Those fractions of a second add up in a certain way that's detrimental to the performance of the app. The timeleak code uses Ptr Word8 to pickle which should be very efficient. I believe the delay comes from the way 'sequ' is compiled by GHC. I'll take the liberty of quoting Andrew Kennedy (your colleague from MS Research) who wrote the picklers: -- My original pickler implementation was for SML. It was used in the MLj compiler, and is still used in the SML.NET compiler, and has acceptable performance (few ms pickling/unpickling for typical intermediate language object files). I must admit that I've not used the Haskell variant in anger. Apart from the inherent slowdown associated with laziness, is there a particular reason for poor performance? -- 'sequ' by itself does not seem like a big deal but when used to model records it builds a large nested lambda-list and I don't think that list is being compiled efficiently. I would appreciate if you could look at that and issue a verdict now that Andrew cofirms using the picklers in a real-life environment and w/o major problems. Suppose I chose a different implementation of binary IO and disposed of pickler combinators. Suppose I gained a 2x speed-up by doing so. I would now be getting alerts with 100 threads instead of 50, no? That's still far from ideal. Joel On Jan 3, 2006, at 4:43 PM, Simon Marlow wrote:
The reason things are the way they are is that a large number of *running* threads is not a workload we've optimised for. In fact, Joel's program is the first one I've seen with a lot of running threads, apart from our testsuite. And I suspect that when Joel uses a better binary I/O implementation a lot of that CPU usage will disappear.

Joel, In most cases, it just doesn't make sense to run 1000 threads simultaneously that are all bottlenecked on the same resource (e.g. CPU/memory) See e.g. http://www.eecs.harvard.edu/~mdw/proj/seda/ You should be grouping incoming events into queues by expected workload/event. Then you can give the client fairly reliable information about how long it will have to wait based on the size of the queue on which event is waiting. And if you have no way to differentiate between event workloads a priori then you really can't be giving clients response guarantees and need to rethink your business logic. FYI: I actually created a Haskell application server based on this logic called HAppS (see http://happs.org) and am in the process of getting binaryIO added to it. -Alex- ______________________________________________________________ S. Alexander Jacobson tel:917-770-6565 http://alexjacobson.com On Tue, 3 Jan 2006, Joel Reymont wrote:
Simon,
I don't think CPU usage is the issue. An individual thread will take a fraction of a second to deserialize a large packet. The issue is that, as you pointed out, you can get alerts even with 50 threads. Those fractions of a second add up in a certain way that's detrimental to the performance of the app.
The timeleak code uses Ptr Word8 to pickle which should be very efficient. I believe the delay comes from the way 'sequ' is compiled by GHC. I'll take the liberty of quoting Andrew Kennedy (your colleague from MS Research) who wrote the picklers:
-- My original pickler implementation was for SML. It was used in the MLj compiler, and is still used in the SML.NET compiler, and has acceptable performance (few ms pickling/unpickling for typical intermediate language object files). I must admit that I've not used the Haskell variant in anger. Apart from the inherent slowdown associated with laziness, is there a particular reason for poor performance? --
'sequ' by itself does not seem like a big deal but when used to model records it builds a large nested lambda-list and I don't think that list is being compiled efficiently. I would appreciate if you could look at that and issue a verdict now that Andrew cofirms using the picklers in a real-life environment and w/o major problems.
Suppose I chose a different implementation of binary IO and disposed of pickler combinators. Suppose I gained a 2x speed-up by doing so. I would now be getting alerts with 100 threads instead of 50, no? That's still far from ideal.
Joel
On Jan 3, 2006, at 4:43 PM, Simon Marlow wrote:
The reason things are the way they are is that a large number of *running* threads is not a workload we've optimised for. In fact, Joel's program is the first one I've seen with a lot of running threads, apart from our testsuite. And I suspect that when Joel uses a better binary I/O implementation a lot of that CPU usage will disappear.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

The timeleak code is just a repro case. In real life I'm reading from sockets as opposed to a file. All I'm trying to do is run poker bots. They talk to the server and play poker. Of course some events are more important than others, a request to make a bet is more important than, say, a table update. I do need to run as many poker bots as I can. I think that my customer's goal of 4,000 bots is unattainable in a single app. It's probably possible per machine. Overall, I find this too complex to manage with Haskell as there are many factors that can contribute to my delays and timeouts. There are also quite a few unanswered questions at the moment (why is 'sequ' slow? does the scheduler need to be tuned?) that leave me scratching my head. On Jan 3, 2006, at 9:17 PM, S. Alexander Jacobson wrote:
You should be grouping incoming events into queues by expected workload/event. Then you can give the client fairly reliable information about how long it will have to wait based on the size of the queue on which event is waiting.
And if you have no way to differentiate between event workloads a priori then you really can't be giving clients response guarantees and need to rethink your business logic.

Hello Joel, Wednesday, January 04, 2006, 12:42:24 AM, you wrote: JR> contribute to my delays and timeouts. There are also quite a few JR> unanswered questions at the moment (why is 'sequ' slow? are you tried to inline it? and all other pickling combinators the problem is what when you write put (Cmd a b) = do putWord16 a; putWord32 b and inline putWord16/putWord32, you can be sure that you will get sequencing for free. but what is a pickling combinators? it's a high-order functions, which combines drivers for simple types like Word16 to final driver which can read entire Command. the principial question - will this final driver be interpreted, i.e. executed as a large number of enclosed calls to pickler combination functions, or it will be compiled, i.e. executed as simple sequence of getByte calls, which then builds the final value. in first case your program will spend all its time in these combinator funtions calls, so you will not have much effect from using Ptrs in elementary picklers because of this issue i said you that pickler combinators can't guarantee performance, in constrast to Binary package which uses an artless approach to combine individual "picklers" - separate get and put fucntions so that each one becomes an straightforward imperative program as opposite to tuple carrying several functional values JR> does the scheduler need to be tuned?) try something like this: forever recvPacket withMVar global unzip unpickle runScript sendAnswer yield it will ensure that commands will be processed sequentially; i think it's the best you can do in this program: all tasks inside lock are cpu-bound, so it is better to finish them in one thread before going to another in production code you will also need to guard whole withMVar block with small timeout (say, 0.02s) -- Best regards, Bulat mailto:bulatz@HotPOP.com

This is my latest version. Based on Don's tweaks. {-# INLINE sequ #-} sequ :: (b -> a) -> PU a -> (a -> PU b) -> PU b sequ a b c | a `seq` b `seq` c `seq` False = undefined sequ f pa k = PU fn1 fn2 fn3 where {-# INLINE fn1 #-} fn1 ptr b = case f b of a -> case k a of pb -> do ptr' <- appP pa ptr a appP pb ptr' b {-# INLINE fn2 #-} fn2 ptr = do (a, ptr') <- appU pa ptr case k a of pb -> appU pb ptr' {-# INLINE fn3 #-} fn3 b = case f b of a -> case k a of pb -> do sz1 <- appS pa a sz2 <- appS pb b return $! sz1 + sz2 On Jan 4, 2006, at 4:18 PM, Bulat Ziganshin wrote:
are you tried to inline it? and all other pickling combinators
the problem is what when you write
put (Cmd a b) = do putWord16 a; putWord32 b
and inline putWord16/putWord32, you can be sure that you will get sequencing for free. but what is a pickling combinators? it's a high-order functions, which combines drivers for simple types like Word16 to final driver which can read entire Command. the principial question - will this final driver be interpreted, i.e. executed as a large number of enclosed calls to pickler combination functions, or it will be compiled, i.e. executed as simple sequence of getByte calls, which then builds the final value. in first case your program will spend all its time in these combinator funtions calls, so you will not have much effect from using Ptrs in elementary picklers

Joel Reymont wrote:
I don't think CPU usage is the issue. An individual thread will take a fraction of a second to deserialize a large packet.
It's a combination of CPU usage by the pickler and GC load. Those 50k packets take 0.03 seconds to unpickle (version of unstuff.hs to measure that is attached). With 100 threads running, even with a completely fair scheduler the time taken for one thread to unpickle that packet is going to be 3 seconds. If optimisation is turned on, the time to unpickle that packet goes down to 0.007 seconds (on my machine). So, it should be more like 400 threads before a fair scheduler would run into problems, but that doesn't take into account GC load, which increases with more threads running, so in fact you still run into trouble with 50 threads. You can reduce the GC load using RTS options: eg. +RTS -H256m and that will reduce the number of timeouts you get. GHC's scheduler may not be completely fair, but I haven't found anything gratuitous in my investigations. Sometimes the time between context switches is more like 0.04 seconds instead of 0.02 seconds, and I still don't understand exactly why, but that's not a serious issue. Somtimes a thread is unlucky enough to have to do a major GC during its timeslice, so it doesn't get its fair share of CPU, but the effect is random and therefore amortized (this isn't a realtime system, after all). What can you do about this? (a) improve performance of the unpickler. As you say, a 2x boost here will double the number of threads you can run in parallel in the time limits you have. (b) try to reduce your heap residency, which will reduce GC load which again means you can run more threads in parallel without hitting your limits. (b) try to manage your latency better, by limiting the number of threads that try to unpickle in parallel. You may reduce the GC load this way, too. Or failing that, just get faster hardware. Or more CPUs, and use GHC's new SMP support. I'm surprised if your real application is this CPU-bound, though. The network communication latency should mean you can run a lot more threads, provided you can improve that pickler so it isn't the bottleneck. I don't have a lot of time to investigate the unpickling code in detail, but I have worked on similar problems in the past and I know that the unpickler in GHC is very fast, for example. It is derived from the original nhc98 interface, with tweaks by me to improve performance, and later NewBinary was derived from it. I haven't measured NewBinary's performance relative to GHC's Binary library, but I don't expect there to be much difference. Cheers, Simon module Main where import System.IO import System.Time import System.Environment import Control.Monad import Control.Concurrent import Control.Exception import Foreign import Pickle import Endian import Util import ZLib import Records import Prelude hiding (read) import Text.Printf main = do args <- getArgs process (head args) 100 waitToFinish {-# NOINLINE lock #-} lock :: MVar () lock = unsafePerformIO $ newMVar () trace s = withMVar lock $ const $ putStrLn s process _ 0 = return () process file n = do h <- openBinaryFile file ReadMode forkChild $ read_ h process file (n - 1) read_ :: Handle -> IO () read_ h = do cmd <- read h (\_ -> return ()) -- lots of ALERTs -- you should not get any alerts if you pass in trace -- below and comment the line above. the lock synch seems -- to have a magical effect -- cmd <- read h trace eof <- hIsEOF h unless eof $ read_ h read :: Handle -> (String -> IO ()) -> IO Command read h trace = do TOD time1 _ <- getClockTime allocaBytes 4 $ \p1 -> do hGetBuf h p1 4 TOD time2 _ <- getClockTime (size', _) <- unpickle endian32 p1 0 TOD time3 _ <- getClockTime let size = fromIntegral $ size' - 4 allocaBytes size $ \packet -> do TOD time4 _ <- getClockTime hGetBuf h packet size TOD time5 _ <- getClockTime cmd <- unstuff packet 0 size TOD time6 _ <- getClockTime trace $ "read: " ++ cmdDesc cmd ++ ": " ++ show (time6 - time1) ++ "s: " ++ show (time2 - time1) ++ "s, " ++ show (time3 - time2) ++ "s, " ++ show (time4 - time3) ++ "s, " ++ show (time5 - time4) ++ "s, " ++ show (time6 - time5) ++ "s" when (time6 - time5 > 3) $ fail $ "RED ALERT: time: " ++ show (time6 - time5) ++ "s, size: " ++ show size' ++ ", cmd: " ++ cmdDesc cmd return $! cmd psecdiff :: ClockTime -> ClockTime -> Integer psecdiff (TOD secs1 psecs1) (TOD secs2 psecs2) = psecs2 - psecs1 + (secs2*10^12 - secs1*10^12) unstuff :: Ptr Word8 -> Int -> Int -> IO Command unstuff ptr ix size = do t1@(TOD time1 _) <- getClockTime (kind, ix1) <- unpickle puCmdType ptr ix t2@(TOD time2 _) <- getClockTime -- when (size > 40000) $ hPutStrLn stderr "unpickle start" (cmd', _) <- unpickle (puCommand kind) ptr ix1 t3@(TOD time3 _) <- getClockTime -- let d = psecdiff t1 t3 -- (secs,psecs) = d `quotRem` (10^12) -- hPrintf stdout "size: %5d, time: %3d.%06d\n" size secs (psecs `quot` 10^6) when (time3 - time1 > 3) $ fail $ "ORANGE ALERT: " ++ show (time2 - time1) ++ "s, " ++ show (time3 - time2) ++ "s, " ++ cmdDesc cmd' ++ ", ix1: " ++ show ix1 ++ ", size: " ++ show size case cmd' of SrvCompressedCommands sz (bytes, ix, src_sz) -> do TOD time1 _ <- getClockTime let sz' = fromIntegral sz allocaBytes sz' $ \dest -> do n <- uncompress (plusPtr bytes ix) src_sz dest sz' TOD time2 _ <- getClockTime when (time2 - time1 > 3) $ fail $ "YELLOW ALERT: time: " ++ show (time2 - time1) ++ "s, size: " ++ show sz ++ ", array: " ++ show bytes cmd'' <- unstuff dest 4 n return $! cmd'' _ -> return cmd'

Hello Simon, Tuesday, January 03, 2006, 7:43:21 PM, you wrote:
The minimum time between context switches is 20 milliseconds.
Is there any good reason why 0.02 seconds is the best that you can get here? Couldn't GHC's internal timer tick at a _much_ faster rate (like 50-100µs or so)?
SM> Sure, there's no reason why we couldn't do this. Of course, even SM> idle Haskell processes will be ticking away in the background, so SM> there's a reason not to make the interval too short. What do SM> you think is reasonable? Simon, the talk is about changing GHC _tick_, which is a _minimal_ possible context switch interval. so, we want to decrease this tick and retain current 20 ms _default_ switch interval. this will make possible to decrease switch interval for programs that really need it, which is currently entirely impossible. -C[<us>]: Sets the context switch interval to <s> seconds. A context switch will occur at the next heap block allocation after the timer expires (a heap block allocation occurs every 4k of allocation). With -C0 or -C, context switches will occur as often as possible (at every heap block allocation). By default, context switches occur every 20ms milliseconds. Note that GHC's internal timer ticks every 20ms, and the context switch timer is always a multiple of this timer, so 20ms is the maximum granularity available for timed context switches. -- Best regards, Bulat mailto:bulatz@HotPOP.com

Bulat Ziganshin wrote:
Tuesday, January 03, 2006, 7:43:21 PM, you wrote:
The minimum time between context switches is 20 milliseconds.
Is there any good reason why 0.02 seconds is the best that you can get here? Couldn't GHC's internal timer tick at a _much_ faster rate (like 50-100µs or so)?
SM> Sure, there's no reason why we couldn't do this. Of course, even SM> idle Haskell processes will be ticking away in the background, so SM> there's a reason not to make the interval too short. What do SM> you think is reasonable?
Simon, the talk is about changing GHC _tick_, which is a _minimal_ possible context switch interval. so, we want to decrease this tick and retain current 20 ms _default_ switch interval. this will make possible to decrease switch interval for programs that really need it, which is currently entirely impossible.
Yes, I know. Cheers, Simon

Bulat Ziganshin wrote:
Tuesday, January 03, 2006, 7:43:21 PM, you wrote:
The minimum time between context switches is 20 milliseconds.
Is there any good reason why 0.02 seconds is the best that you can get here? Couldn't GHC's internal timer tick at a _much_ faster rate (like 50-100µs or so)?
SM> Sure, there's no reason why we couldn't do this. Of course, even SM> idle Haskell processes will be ticking away in the background, so SM> there's a reason not to make the interval too short. What do SM> you think is reasonable?
Simon, the talk is about changing GHC _tick_, which is a _minimal_ possible context switch interval. so, we want to decrease this tick and retain current 20 ms _default_ switch interval. this will make possible to decrease switch interval for programs that really need it, which is currently entirely impossible.
Yes, I know. Cheers, Simon
participants (5)
-
Bulat Ziganshin
-
Joel Reymont
-
S. Alexander Jacobson
-
Sebastian Sylvan
-
Simon Marlow