Killer pickler combinators (was Time leak)

The original paper is at http://research.microsoft.com/ ~akenn/fun/ picklercombinators.pdf My adaptation is at http://wagerlabs.com/timeleak.tgz. This is a full repro case, data included. The issue is that with 1000 threads my serialization is taking a few seconds. Inserting a delay or passing in trace (see README and code) gets serialization time down to 0-1s, the way it should be. What gives? Try it for yourself and please let me know! Thanks, Joel -- http://wagerlabs.com/

Using forkOS with -threaded seems to work, at least on linux and openbsd.
- Cale
On 20/12/05, Joel Reymont
The original paper is at http://research.microsoft.com/ ~akenn/fun/ picklercombinators.pdf
My adaptation is at http://wagerlabs.com/timeleak.tgz. This is a full repro case, data included.
The issue is that with 1000 threads my serialization is taking a few seconds.
Inserting a delay or passing in trace (see README and code) gets serialization time down to 0-1s, the way it should be.
What gives? Try it for yourself and please let me know!
Thanks, Joel
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Does not help me on Mac OSX Tiger. ./a.out trace.dat a.out: user error (ORANGE ALERT: 0s, 4s, SrvServerInfo, ix1: 6, size: 49722) a.out: internal error: scavenge_stack: weird activation record found on stack: 9 Please report this as a bug to glasgow-haskell-bugs@haskell.org, or http://www.sourceforge.net/projects/ghc/ On Dec 21, 2005, at 2:41 AM, Cale Gibbard wrote:
Using forkOS with -threaded seems to work, at least on linux and openbsd.

About the only universal solution seems to pace the threads by passing trace to read. Even then I got 1 alert. Now, can someone explain why the lock eliminates the time leak? On Dec 21, 2005, at 2:41 AM, Cale Gibbard wrote:
Using forkOS with -threaded seems to work, at least on linux and openbsd.

joelr1:
About the only universal solution seems to pace the threads by passing trace to read. Even then I got 1 alert. Now, can someone explain why the lock eliminates the time leak?
It looks like with the 1000s of threads that get run, the problem is just getting enough cpu time for each thread. All the solutions that appear to work involve some kind of blocking (on Handles, MVars, ffi calls), allowing the scheduler to switch. So why don't we just increase the scheduler switch rate ourselves? Using +RTS -C -RTS we can convince the rts to switch on every alloc, and with 4000 threads running this seems to remove the timeout alerts, and give all threads enough time to get things done. This can be baked into the using an rts hook, a small C file containing a C string that is added to the cmd line: char *ghc_rts_opts = "-C0"; Cheers, Don

I still get timeouts with 5k threads. Not as often as with 1k before, though. On Dec 21, 2005, at 3:35 AM, Donald Bruce Stewart wrote:
It looks like with the 1000s of threads that get run, the problem is just getting enough cpu time for each thread. All the solutions that appear to work involve some kind of blocking (on Handles, MVars, ffi calls), allowing the scheduler to switch.
So why don't we just increase the scheduler switch rate ourselves?
Using +RTS -C -RTS we can convince the rts to switch on every alloc, and with 4000 threads running this seems to remove the timeout alerts, and give all threads enough time to get things done.

The other thing worth noting is that by inserting a lock with a thread delay we are fooling ourselves. While the individual pickling time goes down, the threads are slowed down overall. Assuming that an external source was waiting for the unpickled packet _that_ source would get a timeout! -- http://wagerlabs.com/

Well, yes. I think that part of the problem you're having is that the
threads compete very highly, and some eventually get ignored for
longer than the timeouts. The reason why the (withMVar lock $ const $
putStrLn s) helped in the first place was not so much because of any
evaluation of s (we tested that assumption), but rather because during
the time that withMVar is active in one thread, any other threads
can't pass that same point. So a bunch of the threads which were
stealing too much of the time before, now end up blocking, which
reduces the competition so the other threads which were being ignored
by the scheduler can get some work done.
Your problem is partly in the fact that the pickler/unpickler is a bit
slow for the workload/machine you have it on, and partly in the fact
that it's hard to get the processes to advance smoothly. (The
pickling/unpickling doesn't really take very long in any one thread,
but if the thread isn't scheduled for a long time in the first
place...) A bit of locking is sort of what you want for that.
Unfortunately, the way we have it here still affords no guarantees, it
just serves as a mechanism to help keep things from getting too bad.
It's unclear to me that you really don't want any kind of locking. I
think that you might do quite well to decide which threads get to run
yourself at least to some extent. After all, you know more about their
priorities and how long they have to complete their tasks than the GHC
scheduler does.
- Cale
On 21/12/05, Joel Reymont
The other thing worth noting is that by inserting a lock with a thread delay we are fooling ourselves. While the individual pickling time goes down, the threads are slowed down overall. Assuming that an external source was waiting for the unpickled packet _that_ source would get a timeout!

I don't want any kind of locking, true. I need all bots to respond in time otherwise the poker server will sit them out. Eliminating the timeout on pickling does not eliminate the timeout overall, it just passes it to a different place. One thread will go through serialization quickly but it will be too late by the time it sends a response to the server since it waited a few seconds for the chance to have a go at serialization. I'm trying to build 6.5 right now (having trouble, you might have guessed ;)) and will positively study the scheduler. I do not believe thread priorities are supported in GHC, though. I had a thought about using continuations but then I would also have to do selects on a few thousand file descriptors myself. Then I would have to decide which continuation to run based on priorities. I might as well patch the GHC scheduler to do what I need. Alternatively, I can just throw in the towel and rewrite the app in Erlang. I haven't made a firm decision yet, I think I will make my records storable first as you can't get "any closer to the metal". If that does not work then I will just give up. I do have other apps to write in Haskell but it will be a pity if this one does not work out. Thanks, Joel On Dec 21, 2005, at 9:12 PM, Cale Gibbard wrote:
Unfortunately, the way we have it here still affords no guarantees, it just serves as a mechanism to help keep things from getting too bad. It's unclear to me that you really don't want any kind of locking. I think that you might do quite well to decide which threads get to run yourself at least to some extent. After all, you know more about their priorities and how long they have to complete their tasks than the GHC scheduler does.

On 12/21/05, Joel Reymont
I don't want any kind of locking, true. I need all bots to respond in time otherwise the poker server will sit them out. Eliminating the timeout on pickling does not eliminate the timeout overall, it just passes it to a different place.
One thread will go through serialization quickly but it will be too late by the time it sends a response to the server since it waited a few seconds for the chance to have a go at serialization. I'm trying to build 6.5 right now (having trouble, you might have guessed ;)) and will positively study the scheduler. I do not believe thread priorities are supported in GHC, though.
I had a thought about using continuations but then I would also have to do selects on a few thousand file descriptors myself. Then I would have to decide which continuation to run based on priorities. I might as well patch the GHC scheduler to do what I need. Alternatively, I can just throw in the towel and rewrite the app in Erlang.
I haven't made a firm decision yet, I think I will make my records storable first as you can't get "any closer to the metal". If that does not work then I will just give up. I do have other apps to write in Haskell but it will be a pity if this one does not work out.
I've only skimmed through this, so I may miss the point, but it sounds like a latency vs bandwidth discussion. Let's say you push through 5000 requests in one second (i.e you start 5000 processes and then exactly one second later all 5000 complete simultaneously). Now 50000 in ten seconds is actually the exact same throughput, but if your timeout is three seconds, then you'll get problems. So your problem is that you only do a tiny bit of work for each processess over and over, think of the scheduler just looping through the processess giving it a tiny slice of time, over and over and over. It may take ten seconds for any individual process to complete, but the full throughput is still the same. When you increase the number of processes you won't see the additional processes timeout, but ALL processess (or at least many of them). My suggestion is this: Find out how many processes can be serviced at one time without getting timed out (i.e. find a good compromise between latency and bandwidth), then wrap up the computaitons in a semaphore containing exactly that many resources. I think you probably want this amount to be somwehere around the number of actual CPU cores you have. Having a process' computation wait for 99% of the timeout out to start and then complete it in the final 1% is no worse then having it slowly compute its result for the duration of the timeout. The difference is that if you run out of CPU juice, only some of the processess get hurt (they timout before they start), instead of all of them (the time it takes to compute each of them is more than the timeout because the CPU is spread too thin). /S -- Sebastian Sylvan +46(0)736-818655 UIN: 44640862

Hello Joel, Wednesday, December 21, 2005, 5:39:43 AM, you wrote: JR> The issue is that with 1000 threads my serialization is taking a few JR> seconds. 1) i think that this method of deserialization may be just inefficient. what is plain deserialization time for this 50k? 2) try to insert lock around thread startup code, something like this: withMVar x $ \_ -> send "Hello" data <- get cmd <- unpickle data ... -- Best regards, Bulat mailto:bulatz@HotPOP.com

On Dec 21, 2005, at 7:45 AM, Bulat Ziganshin wrote:
1) i think that this method of deserialization may be just inefficient. what is plain deserialization time for this 50k?
No idea. I know it's inefficient but this is not the issue. The issue is that with some strange tweaks it runs fast.
2) try to insert lock around thread startup code, something like this:
withMVar x $ \_ -> send "Hello" data <- get cmd <- unpickle data
I have no "data <- get". Did you try this on your system to see if your fix works? We spent a lot of time tweaking this on #haskell yesterday and you need to be quite precise at what you tweak to get it right. I would appreciate if you could actually run the code. Also, what will this lock do and how far do you want it to extend? I do need the threads to run in parallel. Thanks, Joel -- http://wagerlabs.com/

On Wed, Dec 21, 2005 at 10:22:11AM +0000, Joel Reymont wrote:
On Dec 21, 2005, at 7:45 AM, Bulat Ziganshin wrote:
1) i think that this method of deserialization may be just inefficient. what is plain deserialization time for this 50k?
No idea. I know it's inefficient but this is not the issue. The issue is that with some strange tweaks it runs fast.
My guess is that without tweaks the threads allocate more, which increases the pressure on GC. Try to analyze the GC stats with and without tweaks. Best regards Tomasz -- I am searching for a programmer who is good at least in some of [Haskell, ML, C++, Linux, FreeBSD, math] for work in Warsaw, Poland

This should not be the case. The amount of work is the same regardless and the issues seem to be with _timing_. Passing in trace that writes to the screen with a lock sort of slows things down. I encourage you to actually build the code and see it for yourself. Thanks, Joel On Dec 21, 2005, at 10:26 AM, Tomasz Zielonka wrote:
My guess is that without tweaks the threads allocate more, which increases the pressure on GC. Try to analyze the GC stats with and without tweaks.

On Wed, Dec 21, 2005 at 10:33:12AM +0000, Joel Reymont wrote:
My guess is that without tweaks the threads allocate more, which increases the pressure on GC. Try to analyze the GC stats with and without tweaks.
This should not be the case. The amount of work is the same regardless and the issues seem to be with _timing_. Passing in trace that writes to the screen with a lock sort of slows things down.
I was unclear. Of course, every thread does the same amount of work, but "without tweaks" the threads working together can manage to increase the amount of memory that is required at a given point in time. Take an example - you have a computation that requires 1MB of memory. If you run it sequentially 1000 times, you still need only 1MB of memory. But if you span 1000 threads simultaneously and run the computation it each of them, you may need even 1000 * 1MB of memory. This may cause problems with cache, swaping, etc. The other question is how GHC's garbage collection performance scale in such situations? Best regards Tomasz -- I am searching for a programmer who is good at least in some of [Haskell, ML, C++, Linux, FreeBSD, math] for work in Warsaw, Poland

From: Joel Reymont
To: Haskell-Cafe Cafe Subject: [Haskell-cafe] Killer pickler combinators (was Time leak) Date: Wed, 21 Dec 2005 02:39:43 +0000 The original paper is at http://research.microsoft.com/ ~akenn/fun/ picklercombinators.pdf
My adaptation is at http://wagerlabs.com/timeleak.tgz. This is a full repro case, data included.
The issue is that with 1000 threads my serialization is taking a few seconds.
Inserting a delay or passing in trace (see README and code) gets serialization time down to 0-1s, the way it should be.
What gives? Try it for yourself and please let me know!
This has easy explanation. I am learning haskell and your programs are great of that. In this code: do TOD time1 _ <- getClockTime (kind, ix1) <- unpickle puCmdType ptr ix TOD time2 _ <- getClockTime (cmd', _) <- unpickle (puCommand kind) ptr ix1 TOD time3 _ <- getClockTime you get time multiple times. So if you lock whole operation nothing else will be executed while in this code and therefore you don;t have timeouts. But, without lock more threads you have, more time have to wait for operation to finish and therefore timeouts. Since I see that Haskell executes all this in single thread, lock just ensures that your operasion will not be interrupted, thereferore not time outs. But if you measure say 5000 reads cumulative time, you'll get same problem again. Let's say you have more then one worker thread and multiple CPU's. Only then situation will be better. Perhaps you'll get somewhat better perfromance with OS context switch, but not to avail, it is humpered with same problem. You need more CPU-s and worker threads in order to service such large number of tasks. Just measure how much requests can be serviced in reasonable time and limit on that. For single cpu lock will be ok, but for multiple CPU's you have to somehow say Haskell to spawn multiple workers (more then one). I would be glad to know how to tell run time to use more then one worker thread. Greetings, Bane.
Thanks, Joel
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

I'm not sure I buy this. Again, this helps: {-# NOINLINE lock #-} lock :: MVar () lock = unsafePerformIO $ newMVar () trace s = withMVar lock $ const $ putStrLn s and then in read_ cmd <- read h trace trace is called _after_ all the timings in read so it should not affect the timings. You could basically say that the lock is at the end of read, after all the unpickling has been done. The other interesting thing is that replacing trace with delay _ = threadDelay 1 does not solve the issue. On Dec 21, 2005, at 2:33 PM, Branimir Maksimovic wrote:
In this code: do TOD time1 _ <- getClockTime (kind, ix1) <- unpickle puCmdType ptr ix TOD time2 _ <- getClockTime (cmd', _) <- unpickle (puCommand kind) ptr ix1 TOD time3 _ <- getClockTime you get time multiple times. So if you lock whole operation nothing else will be executed while in this code and therefore you don;t have timeouts. But, without lock more threads you have, more time have to wait for operation to finish and therefore timeouts.

From: Joel Reymont
To: "Branimir Maksimovic" CC: haskell-cafe@haskell.org Subject: Re: [Haskell-cafe] Killer pickler combinators (was Time leak) Date: Wed, 21 Dec 2005 14:51:42 +0000 I'm not sure I buy this. Again, this helps:
{-# NOINLINE lock #-} lock :: MVar () lock = unsafePerformIO $ newMVar ()
trace s = withMVar lock $ const $ putStrLn s
and then in read_
cmd <- read h trace
trace is called _after_ all the timings in read so it should not affect the timings.
It does not affects timings directly , but indirectly. You have putStrLn which performs some work and that is point of serialization. Try no op instead of putStrLn and you should get timeouts again. i'm sure you'll get even better operation timings if you lock around whole timing operation instead. This will be valid only for single CPU only. In this way you actually get same throughoutput, with or without lock. It' just you have to measure cummulative reads/sec which will be same with/without lock for single CPU/worker thread. Again, only way to improve performance is to use more then one CPU/worker thread. Greetings, Bane.
You could basically say that the lock is at the end of read, after all the unpickling has been done. The other interesting thing is that replacing trace with
delay _ = threadDelay 1
does not solve the issue.
On Dec 21, 2005, at 2:33 PM, Branimir Maksimovic wrote:
In this code: do TOD time1 _ <- getClockTime (kind, ix1) <- unpickle puCmdType ptr ix TOD time2 _ <- getClockTime (cmd', _) <- unpickle (puCommand kind) ptr ix1 TOD time3 _ <- getClockTime you get time multiple times. So if you lock whole operation nothing else will be executed while in this code and therefore you don;t have timeouts. But, without lock more threads you have, more time have to wait for operation to finish and therefore timeouts.
_________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
participants (7)
-
Branimir Maksimovic
-
Bulat Ziganshin
-
Cale Gibbard
-
dons@cse.unsw.edu.au
-
Joel Reymont
-
Sebastian Sylvan
-
Tomasz Zielonka