
Hello, I haven't followed everything that's happened on the Binary IO thread, but has anybody else actually tried Joels code? .. http://wagerlabs.com/timeleak.tgz I can reproduce the problem (ghc/Linux), but can't explain it. It seems very strange that friggin about with an otherwise irrelevant (AFAICT) MVar fixes the problem. Regards -- Adrian Hey

On Thu, Dec 29, 2005 at 01:02:45PM +0000, Adrian Hey wrote:
I haven't followed everything that's happened on the Binary IO thread, but has anybody else actually tried Joels code? ..
I have and I am puzzled too.
I can reproduce the problem (ghc/Linux), but can't explain it. It seems very strange that friggin about with an otherwise irrelevant (AFAICT) MVar fixes the problem.
Could be something with GHC's thread scheduler (is it fair?). I have some ideas for workarounds, but I didn't get around to trying them. For example, I think that unpickling too many messages simultaneously put too much pressure on memory, cache and GC. I don't really have the time to test my serialization libraries with Joels code - mostly it's this unicode that scares me off (should it?). Maybe I'll try to simplify things on my own. Best regards Tomasz -- I am searching for a programmer who is good at least in some of [Haskell, ML, C++, Linux, FreeBSD, math] for work in Warsaw, Poland

Adrian, There's no mistery here. Threads take a while to unpickle the large server info packet when the gang up on it all together. By adding the MVar you are basically installing a door in front of the packet and only letting one thread come through. The end result is that you are pushing the timeout somewhere else as threads spend a lot of time queued up and waiting to get through the door. It takes a fraction of a second for a thread to unpickle the server info packet. It needs to make a FFI call to ZLib's uncompress and that's a blocking call since it's marked unsafe. I think that happens rather quickly otherwise alerts would be showing here. What takes time is unpickling the packet after it's uncompressed. Why does it take a fraction of a second for 1 thread to unpickle and several seconds per thread for several threads to do it at the same time? I think this is where the mistery lies. Joel On Dec 29, 2005, at 1:02 PM, Adrian Hey wrote:
I can reproduce the problem (ghc/Linux), but can't explain it. It seems very strange that friggin about with an otherwise irrelevant (AFAICT) MVar fixes the problem.

On Thu, Dec 29, 2005 at 01:20:41PM +0000, Joel Reymont wrote:
Why does it take a fraction of a second for 1 thread to unpickle and several seconds per thread for several threads to do it at the same time? I think this is where the mistery lies.
Have you considered any of this: - too big memory pressure: more memory means more frequent and more expensive GCs, 1000 threads using so much memory means bad cache performance - a deficiency of GHC's thread scheduler - giving too much time one thread steals it from others (Simons, don't get angry at me - I am probably wrong here ;-) Best regards Tomasz -- I am searching for a programmer who is good at least in some of [Haskell, ML, C++, Linux, FreeBSD, math] for work in Warsaw, Poland

On Dec 29, 2005, at 1:23 PM, Tomasz Zielonka wrote:
- a deficiency of GHC's thread scheduler - giving too much time one thread steals it from others (Simons, don't get angry at me - I am probably wrong here ;-)
I would finger the scheduler, at least partially. There's no magic in this world and my Erlang version does not fare much better. Erlang too uses a GC and when a lot of threads load all that data into memory things are bound to get nasty. The results for Erlang are more uniform, though. All the delays are all within 3-4s if I have 3s as the cut-off time and with GHC the delays are all over the place. Joel -- http://wagerlabs.com/

Tomasz Zielonka wrote:
On Thu, Dec 29, 2005 at 01:20:41PM +0000, Joel Reymont wrote:
Why does it take a fraction of a second for 1 thread to unpickle and several seconds per thread for several threads to do it at the same time? I think this is where the mistery lies.
Have you considered any of this:
- too big memory pressure: more memory means more frequent and more expensive GCs, 1000 threads using so much memory means bad cache performance - a deficiency of GHC's thread scheduler - giving too much time one thread steals it from others (Simons, don't get angry at me - I am probably wrong here ;-)
I don't think there's anything really strange going on here. The default context switch interval in GHC is 0.02 seconds, measured in CPU time by default. GHC's scheduler is stricly round-robin, so therefore with 100 threads in the system it can be 2 seconds between a thread being descheduled and scheduled again. I measured the time taken to unpickle those large 50k packets as 0.3 seconds on my amd64 box (program compiled *without* optimisation), so the thread can get descheduled twice during while unpickling a large packet, giving a >4s delay with 100 threads running. The actual context switch interval seems to often be larger than 0.2 seconds; I'm not sure exactly why this is, it might be due to delays in the OS delivering the signal. This does mean that the timeleak program reports alerts for as little as 50 threads, though. Cheers, Simon

On Jan 3, 2006, at 2:30 PM, Simon Marlow wrote:
The default context switch interval in GHC is 0.02 seconds, measured in CPU time by default. GHC's scheduler is stricly round- robin, so therefore with 100 threads in the system it can be 2 seconds between a thread being descheduled and scheduled again.
I measured the time taken to unpickle those large 50k packets as 0.3 seconds on my amd64 box (program compiled *without* optimisation), so the thread can get descheduled twice during while unpickling a large packet, giving a >4s delay with 100 threads running.
Is it impractical then to implement this type of app in Haskell? Based on the nature of Haskell scheduling I would be inclined to say yes. I'm including information on the Erlang scheduler below. I think it's possible to emulate the workings of the Erlang scheduler in Haskell by using delimited continuations a-la Zipper File Server/ OS. A single delimited continuation (request in Zipper FS parlance?) would be a scheduling unit and a programmer could then tune the "scheduler" to their hearts content. Apart from putting a lot of burden on the programmer this becomes quite troublesome when multiple sockets or file descriptors are concerned. There's no easy way to plug into the select facility of the Haskell runtime to receive notifications of input available. You will notice the Zipper FS spending quite a few lines of code to roll its own select facility. The Erlang scheduler is based on reduction count where one reduction is roughly equivalent to a function call. See http://www.erlang.org/ ml-archive/erlang-questions/200104/msg00072.html for more detail. There's also this helpful bit of information: -- erlang:bump_reductions(Reductions) -> void() Types Reductions = int() This implementation-dependent function increments the reduction counter for the calling process. In the Beam emulator, the reduction counter is normally incremented by one for each func- tion and BIF call, and a context switch is forced when the counter reaches 1000. -- Regarding the issue of why a logger process in Erlang does not get overwhelved, this is the reply I got from Raimo Niskanen (Erlang team at Ericsson): There is a small fix in the scheduler for the standard producer/consumer problem: A process that sends to a receiver having a large receive queue gets punished with a large reduction (number of function calls) count for the send operation, and will therefore get smaller scheduling slots. Thanks, Joel -- http://wagerlabs.com/

On 1/3/06, Simon Marlow
Tomasz Zielonka wrote:
On Thu, Dec 29, 2005 at 01:20:41PM +0000, Joel Reymont wrote:
Why does it take a fraction of a second for 1 thread to unpickle and several seconds per thread for several threads to do it at the same time? I think this is where the mistery lies.
Have you considered any of this:
- too big memory pressure: more memory means more frequent and more expensive GCs, 1000 threads using so much memory means bad cache performance - a deficiency of GHC's thread scheduler - giving too much time one thread steals it from others (Simons, don't get angry at me - I am probably wrong here ;-)
I don't think there's anything really strange going on here.
The default context switch interval in GHC is 0.02 seconds, measured in CPU time by default. GHC's scheduler is stricly round-robin, so therefore with 100 threads in the system it can be 2 seconds between a thread being descheduled and scheduled again.
According to this: http://www.haskell.org/ghc/docs/latest/html/users_guide/sec-using-parallel.h... The minimum time between context switches is 20 milliseconds. Is there any good reason why 0.02 seconds is the best that you can get here? Couldn't GHC's internal timer tick at a _much_ faster rate (like 50-100µs or so)? Apart from meaning big trouble for applications with a large number of threads (such as Joels) it'll also make life difficult for any sort of real-time application. For instance if you want to use HOpenGL to render a simulation engine and you split it up into tons of concurrent processes (say one for each dynamic entity in the engine), the 20ms granularity would make it quite hard to achieve 60 frames per second in that case... /S -- Sebastian Sylvan +46(0)736-818655 UIN: 44640862

General follow-up questions: Would adding Control.Concurrent.yield commands cause a context switch more often than every 0.02 seconds? Is there any command in GHC to allow a thread to prevent itself from being rescheduled while computing something? Another comment: between 1000's of threads and writing a custom continuation based scheduler, what about using a thread pool? Does anyone have a library with a "fork-IO-Pool" command? -- Chris Simon Marlow wrote:
Tomasz Zielonka wrote:
On Thu, Dec 29, 2005 at 01:20:41PM +0000, Joel Reymont wrote:
Why does it take a fraction of a second for 1 thread to unpickle and several seconds per thread for several threads to do it at the same time? I think this is where the mistery lies.
Have you considered any of this:
- too big memory pressure: more memory means more frequent and more expensive GCs, 1000 threads using so much memory means bad cache performance - a deficiency of GHC's thread scheduler - giving too much time one thread steals it from others (Simons, don't get angry at me - I am probably wrong here ;-)
I don't think there's anything really strange going on here.
The default context switch interval in GHC is 0.02 seconds, measured in CPU time by default. GHC's scheduler is stricly round-robin, so therefore with 100 threads in the system it can be 2 seconds between a thread being descheduled and scheduled again.
I measured the time taken to unpickle those large 50k packets as 0.3 seconds on my amd64 box (program compiled *without* optimisation), so the thread can get descheduled twice during while unpickling a large packet, giving a >4s delay with 100 threads running.
The actual context switch interval seems to often be larger than 0.2 seconds; I'm not sure exactly why this is, it might be due to delays in the OS delivering the signal. This does mean that the timeleak program reports alerts for as little as 50 threads, though.
Cheers, Simon
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, Jan 03, 2006 at 02:30:53PM +0000, Simon Marlow wrote:
I measured the time taken to unpickle those large 50k packets as 0.3 seconds on my amd64 box (program compiled *without* optimisation), so the thread can get descheduled twice during while unpickling a large packet, giving a >4s delay with 100 threads running.
I might have made an error when counting the packets. I simply placed a putStrLn in "read", but some of the packets are nested (SrvCompressedCommands), so "read" is called more than once for a top-level packet. Best regards Tomasz -- I am searching for programmers who are good at least in (Haskell || ML) && (Linux || FreeBSD || math) for work in Warsaw, Poland
participants (6)
-
Adrian Hey
-
Chris Kuklewicz
-
Joel Reymont
-
Sebastian Sylvan
-
Simon Marlow
-
Tomasz Zielonka