[GHC] #7606: Stride scheduling for Haskell threads with priorities

#7606: Stride scheduling for Haskell threads with priorities -----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Component: Runtime System Version: 7.7 | Keywords: Os: Unknown/Multiple | Architecture: Unknown/Multiple Failure: None/Unknown | Blockedby: Blocking: | Related: -----------------------------+---------------------------------------------- Currently, GHC uses a round-robin scheduler for Haskell threads, with some heuristics for when threads should be bumped to the front of the queue. This patch set replaces this scheduler with 'stride scheduling', as described by Waldspurger and Weihl '95, which is an efficient, deterministic method for scheduling processes with differing priorities. Priorities are assigned by giving 'tickets' to threads; a thread with twice as many tickets as another will run twice as often. I’d like to replace the round-robin scheduler completely with this scheduler. Here are nofib benchmarks comparing the old scheduler to the new: {{{ -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -0.0% -52.2% -18.8% -18.6% -83.1% Max +1.0% +4.2% +4.9% +5.1% +7.7% Geometric Mean +0.1% -2.8% -0.9% -0.9% -2.9% }}} Here are some technical details: * Without any changes to ticket values, scheduling behavior should be functionally identical to round-robin. (By default, new threads, including the IO thread, get allocated the max nubmer of tickets possible.) This is not quite identical, since our heap does not have FIFO property (see below.) * The current patch-set uses a very simple (e.g. undergrad level) resizable-array backed heap to implement the priority queue; we can play some tricks to reduce the memory footprint of the priority queue (e.g. using the container_of macro to eliminate the extra keys store); and a more fancy data structure would make it easier for us to surgically remove entries or reweight them, but I wanted to keep overhead low. If anyone has a pet implementation of priority queues in C feel free to swap it in. Right now, this only affects the uses of promoteInRunQueue() in Messages.c; I still need to check if #3838 has regressed. * We get three new primops: `setTickets#`, `getTickets#` and `modifyTickets#`. We don't support creating threads with specific numbers of tickets (mostly because that would have added an annoyingly large set of extra primops); instead, you're expected to spawn a thread which gets max-ticket allocation, and then weight it down. * `_link` is no longer used for linking TSOs in the run queue. I have tried my best to stamp out any code which operated on this assumption, but I may have missed some. * Modifying a TSO's tickets takes out the scheduler lock; the hope is that this operation is quick and rare enough that a global lock here is not too bad. * We are unfortunately stuck with some magic constants w.r.t. ticket values: 1 << 20 is the maximum number of tickets our implementation is hard-coded to support. * Sleeping and blocked tasks do not get any 'bonus' for being blocked. * In an ideal world, when a thread hits a black hole, it should temporarily give its tickets to the thread evaluating the black hole, so it will unblock more quickly. Unfortunately, implementing this is pretty complicated (the blackhole evaluating thread could die; or it could get stuck on a blackhole itself and need to gift its tickets; it shouldn't be able to give away the tickets it was gifted.) So this implementation leaves that out. Similar semantics for MVars are probably possible, but will require userland assistance too. I haven't prepared a patch to base yet with a user-level API, but here is a proposed draft: {{{ type Tickets = Int -- | Maximum number of tickets we support a thread having. (Currently 2 >> 20) -- Note that this doesn't bound the *global* maximum tickets. maxTickets :: Tickets -- | Changes the number of tickets allocated to a thread. The ticket count must -- not be less than or equal to zero, or greater than maxTickets. (Corresponds -- to setTickets# primop) setTickets :: ThreadId -> Tickets -> IO () -- | Returns the number of tickets currently allocated to a thread. (Corresponds to -- getTickets# primop) getTickets :: ThreadId -> IO Tickets -- | Atomically performs a linear transformation on the number of tickets a thread; -- e.g. scaling the number of tickets by a rational number, adding another fixed -- set of tickets, and then returning the number of 'leftover' tickets from the operation; e.g. -- if the net amount of tickets is reduced, then the returned result is positive; -- if the net amount of tickets is increased, the returned result is negative. -- In the absence of concurrent threads, the following property holds forall -- t, m and x: -- -- do r <- getTickets t -- d <- scaleTickets t m x -- r' <- getTickets t -- return (r == r' + d) -- -- If the scaling would reduce the number of tickets below zero or increase the -- number of tickets beyond maxTickets, this function will instead fail with -- an exception. This function will be subject to some rounding error; powers of two -- are, however, likely to be exact. (Corresponds to modifyTickets# primop; note -- that the sentinel value for failure is maxTickets + 1, since it is impossible for -- a thread's ticket value to change by that much.) modifyTickets :: ThreadId -> Ratio Int -> Tickets -> IO Tickets -- | Forks a new thread, transferring some percentage of tickets from the current -- thread to it (so the net number of tickets stays constant.) Fails if the rational -- is greater than 1 or less than or equal to zero, or if there are not enough tickets -- in the current thread. forkIOWith :: IO a -> Ratio Int -> IO ThreadId }}} -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Changes (by simonmar): * difficulty: => Unknown * milestone: => 7.8.1 Comment: Awesome! Ok, to get started on the reviewing process, I'll jot down some concerns. * First, nofib has zero concurrency benchmarks in it, so the results aren't very meaningful. Still, you seem to have some outliers in there - any idea why? nofib/smp has some concurrency benchmarks that I normally use when making changes to the scheduler. * Run all the tests in `testsuite/concurrent`. They aren't run by default in a validate. * The patch adds no less than 4 fields to the TSO struct, which is quite a lot. We should benchmark carefully to make sure this isn't a problem. * Why stride scheduling rather than something else, e.g. Linux's CFS? Wouldn't CFS address the issue of giving more time to blocked threads? What about inaccuracies due to the time it takes for a thread to respond to the timer signal? * Currently, if there's a very long run queue, a minor GC won't have to traverse a lot of it because it will be in the old generation, but here the GC has to look at every element of the priority queue. We should make a benchmark with a long run queue and see if that's an issue. * `promoteInRunQueue`: it would be a shame to lose this, but as long as we're sure that #3838 hasn't regressed it should be ok. * could we have some more docs on the scheduler functions, `capPassUpdate` etc. please? * Looks like we should do `annulTSO` in a better way. These fields should be empty normally, set when a thread is blocked, and unset again when the thread is unblocked. * I don't see why `sched_mutex` is needed in `setTickets` and co. Could you explain? -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by simonpj): Can you also say a bit more about * What the goal of this change is. * Roughly how the new scheduling algorithm achieves the goal. * There seems to be something about priorities involved. Who sets those priorities? The programmer? The runtime system? * What about priority inversion? Thanks Simon -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): Summary: SMP benchmarks show massive degradation of performance, so probably one of the performance bugs Simon Marlow pointed out is really killing us and needs to be fixed. Replying to [comment:1 simonmar]:
* First, nofib has zero concurrency benchmarks in it, so the results aren't very meaningful. Still, you seem to have some outliers in there - any idea why? nofib/smp has some concurrency benchmarks that I normally use when making changes to the scheduler.
Oops, I forgot that normal nofib doesn't do smp. SMP doesn't look so good, and it seems likely that some of the performance problems you've noted below might be contributing. I'll investigate more thoroughly... {{{ -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- callback001 +0.1% +0.0% +7.7% +7.6% +0.0% callback002 +0.1% +0.0% +10.7% +10.7% +0.0% chan +0.1% +0.0% -29.0% -29.1% -71.0% sieve +0.1% +0.0% +141.6% +142.0% +89.3% threads001 +0.1% +0.0% +12.7% +12.6% +0.0% threads003 +0.1% +9.1% +44.7% +44.6% +25.8% threads006 +0.1% +0.0% +30.3% +30.7% +46.6% threads007 +0.1% -0.2% +17.0% +17.0% -6.8% -------------------------------------------------------------------------------- Min +0.1% -0.2% -29.0% -29.1% -71.0% Max +0.1% +9.1% +141.6% +142.0% +89.3% Geometric Mean +0.1% +1.1% +22.5% +22.5% -0.7% }}}
* Run all the tests in `testsuite/concurrent`. They aren't run by default in a validate.
* The patch adds no less than 4 fields to the TSO struct, which is quite a lot. We should benchmark carefully to make sure this isn't a
Yup, I've been running a full testsuite run and will be checking those results today. problem. I can reduce this by one by paying the cost of a division every time we're scheduled (since stride = STRIDE1/tickets, always); we can reduce it by one more by observing that pass and remain are never both set at the same time and turning it into a union. I'm not sure what benchmarks in particular would start failing due to the increase in size, other than just general badness.
* Why stride scheduling rather than something else, e.g. Linux's CFS? Wouldn't CFS address the issue of giving more time to blocked threads? What about inaccuracies due to the time it takes for a thread to respond to the timer signal?
* Currently, if there's a very long run queue, a minor GC won't have to
Here's the secret: Linux's CFS is stride scheduling! The primary differences here are the backing data structure (they use a red-black tree while we're using a simple heap) and how timeslices are measured (Linux counts nanoseconds; we stupidly just consider a jump to the scheduler one quantum.) It is possible to implement CFS-style sleeper fairness, by modifying what happens to a processes' pass when it sleeps. I'll post a version with those changes. traverse a lot of it because it will be in the old generation, but here the GC has to look at every element of the priority queue. We should make a benchmark with a long run queue and see if that's an issue. Could be a problem. If we switched to a tree data structure, we could probably recover this behavior (the GC will only have to traverse log n).
* `promoteInRunQueue`: it would be a shame to lose this, but as long as we're sure that #3838 hasn't regressed it should be ok.
There is no particular reason why we can't implement this, and if we switch to a red-black tree it gets easier.
* could we have some more docs on the scheduler functions, `capPassUpdate` etc. please?
OK.
* Looks like we should do `annulTSO` in a better way. These fields should be empty normally, set when a thread is blocked, and unset again when the thread is unblocked.
OK.
* I don't see why `sched_mutex` is needed in `setTickets` and co. Could you explain?
setTickets access can be concurrent, so since a read and a write occur, it needs to be protected by a lock. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): Replying to [comment:2 simonpj]:
* What the goal of this change is.
The goal of the change is to allow Haskell programmers to change how frequently threads they create are being run; in case there are some threads that are important and should be run as frequently as possible, or threads are being created by an untrusted user who should not be allowed to thread-bomb the system.
* Roughly how the new scheduling algorithm achieves the goal.
* There seems to be something about priorities involved. Who sets
Stride scheduling operates by replacing our old doubly linked list run queue with a priority queue. The priorities are 'passes' which increment by their stride (computed based on their priority) every time they run; thus, a task with a low stride will run multiple times before their pass catches up. those priorities? The programmer? The runtime system? The priorities are set by the programmer using setTickets and related functions.
* What about priority inversion?
Without any change to priorities, scheduling behavior is intended to be identical to what it was previously. Priority inversion can occur when a thread blocks on another; this patch doesn't address that yet (though I've mentioned about how to do it for BLACKHOLES). -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Currently, if there's a very long run queue, a minor GC won't have to
#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): traverse a lot of it because it will be in the old generation, but here the GC has to look at every element of the priority queue. We should make a benchmark with a long run queue and see if that's an issue. Based on some investigation, this appears to be probably one of the primary causes of slowdown. I checked to see if this was an issue by checking how often markCapability calls occurred versus actual evacuations performed on TSO objects. Although the number of markCapability calls is the same in both cases, TSO objects get evacuated a lot more frequently in the new world. One of the things that I'm not really clear about is how to tell if I can skip evacuating an object. This GC part of the code is pretty opaque to me, and while I feel like there should be something like a log(n) solution I'm not seeing an implementation path. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by simonmar): I would expect to see lots more calls to evacuate on TSOs, but in most cases the TSO should be in the old generation, so evacuate will just return (in the case of a minor GC, that is). Are you counting actual copies, or just calls to evacuate? It would probably be a good idea to add some diagnostics to track the size of the run queue too. I'm not completely convinced that this is your problem yet. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): You’re right, it’s not (or, at least, the situation is more complicated.) With the following gratuitous change: {{{ diff --git a/rts/Capability.c b/rts/Capability.c index 811df58..b4a1ec0 100644 --- a/rts/Capability.c +++ b/rts/Capability.c @@ -1007,6 +1007,10 @@ markCapability (evac_fn evac, void *user, Capability *cap, // or fewer Capabilities as GC threads, but just in case there // are more, we mark every Capability whose number is the GC // thread's index plus a multiple of the number of GC threads. + StgTSO *tso; + for (tso = cap->run_queue_hd; tso != END_TSO_QUEUE; tso = tso->_link) { + evac(user, (StgClosure **)(void *)&tso); + } evac(user, (StgClosure **)(void *)&cap->run_queue_hd); evac(user, (StgClosure **)(void *)&cap->run_queue_tl); #if defined(THREADED_RTS) }}} We only see a very minor bump in runtime: {{{ -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- sieve +0.0% +0.0% +0.7% +0.7% +0.0% }}} I've been focusing on sieve, since it had the most slowdown. What I do notice when I add some diagnostics is that the run queue starts with a 100 threads in it, and that the program runs a lot more slowly when the run queue is large when it is small. When I count actual copies, TSOs are copied a lot more frequently in the new code. If you have some ideas on what to look at, that would be great! I think what I am going to try to do is implement another version which uses sorted doubly linked lists instead, and see if the same problems crop up. (I also realized that we're leaking memory for the run queues, since I never free them. Oops!) -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): Measuring the effect of adding five 64-bit words to the TSO (and no other changes) {{{ Program Size Allocs Runtime Elapsed TotalMem sieve +0.0% +0.0% +0.7% +0.8% +0.0% }}} E.g. it hurts, but is not the root cause of the major problems. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): Maybe a good idea is to replace the malloc'd array with a MUT_ARR_PTR; then we only evacuate one thing (and all of the card marking happens automatically). Unfortunately, there don't seem to be any amenities for manipulating these from RTS-land. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Changes (by jwlato): * cc: jwlato@… (added) -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Changes (by WrenThornton): * cc: wren@… (added) -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): Here is an alternate patch, which uses sorted linked lists. It is mostly interesting because it also exhibits the same slowdown as the original patch, even though it should hypothetically act the same way as the original scheduler (except for some constant overhead.) -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by simonmar): Perhaps you should check that the scheduling behaviour is the same. e.g. in sieve, if we manage to run a thread to completion in one time slice, then it won't need to be copied by the GC, but if it gets preempted then it will get copied. A glance at the ThreadScope profile, or even just the eventlog, might be illuminating. Also check the `+RTS -s` stats, a good sanity check is whether the amount of allocation is the same (if not, something strange is going on). And the GC stats will give you more clues. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by simonmar): BTW, I like the idea of using a `MUT_ARR`, but the card marking is probably too coarse to help much (128 entries per card). -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): OK, I have a much better sense for where the performance problems are coming from. ---- Adding extra words to the TSO has no impact on most smp benchmarks, but really kills threads006 (up to 100% slowdown); this is not too surprising since this benchmark involves creating 200,000 threads. A representative stat without a small TSO is {{{ 224198416 bytes, 425 GCs (416 + 9), 0/0 avg/max bytes residency (0 samples), 699092656 bytes GC work, 279M in use, 0.00 INIT (0.00 elapsed), 0.02 MUT (0.02 elapsed), 0.33 GC (0.33 elapsed), 0.12 GC(0) (0.12 elapsed), 0.21 GC(1) (0.21 elapsed) }}} A representative stat with a big TSO is {{{ 224198416 bytes, 425 GCs (415 + 10), 0/0 avg/max bytes residency (0 samples), 853296496 bytes GC work, 426M in use, 0.00 INIT (0.00 elapsed), 0.00 MUT (0.00 elapsed), 0.47 GC (0.47 elapsed), 0.12 GC(0) (0.12 elapsed), 0.34 GC(1) (0.35 elapsed) }}} What I don't understand is why the memory in use blows up by 150MB; by my count, the extra words should only be adding something like 10MB of overhead; my best guess is that I am actually nudging TSO size over some invisible threshold (maybe the big object threshold). ---- From there, the next two bugs have to do with subtle scheduler changes. The first bug I have a clean fix for (it was a plain old bug); 'sieve' slows down a lot if threads which HeapOverflow don't get put back in front of the run queue. Fixing that dramatically improves runtime for all of the benchmarks except 'threads003': we can make that performance problem go away if we force threads to get appended to the end of the run queue (of course, stride scheduling won't work in that case!) I'm still investigating these. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): Well, the magic threshold (on x86_64) is apparently one StgWord64 and one StgWord32 (which I believe ); add one more StgWord32 and the memory usage blows up. (That is, increasing the struct's size to 16 64-bit words. 15 and 14 work fine.) I’m a bit surprised we were so close to the limit; I guess I’ll have to figure out how to do this with only two variables, or see if I can condense other parts of the TSO struct. (Actually, we probably have even less to work with since Windows adds another StgWord32 and PROFILING adds another pointer field... so we literally have zero extra fields! Ouch!) -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): I wonder if we're hitting this: {{{ /* The size argument we are given includes all the per-thread * overheads: * * - The TSO structure * - The STACK header * * This is so that we can use a nice round power of 2 for the * default stack size (e.g. 1k), and if we're allocating lots of * threads back-to-back they'll fit nicely in a block. It's a bit * of a benchmark hack, but it doesn't do any harm. */ }}} -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): OK, I've fixed all of the performance problems, except for the TSO one. {{{ -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- callback001 +0.0% +0.0% -0.6% +0.0% +0.0% callback002 +0.0% +0.0% +0.2% +0.0% +0.0% chan +0.0% +0.0% +0.2% +0.2% +3.2% sieve +0.0% +0.0% -0.4% -0.1% +0.0% threads001 +0.0% +0.0% +0.8% +0.8% +0.0% threads003 +0.0% -0.0% +0.1% +0.3% +0.0% threads006 +0.0% +0.0% +35.1% +34.9% +52.7% threads007 +0.0% +0.0% -0.1% +0.0% +0.0% }}} I don't think the TSO problem (in threads006) has to do with influencing stack size; when I run with GC stats, I don't see any thread overflows (the stack is only one word smaller), and the overall allocation remains the same (TSO ends up being 0x10 aligned, but I don't see why that would cause performance problems...) {{{ big (slows down) size = 0x80 sizeOf(StgTSO) = 0x80 (0x10 * 8) free = 0x7ffff6b0a800 allocate(0x70) = 0x7ffff6b0a800 (stack) free = 0x7ffff6b0ab80 (+0x380 == 0x70 * 8) allocate(0x10) = 0x7ffff6b0ab80 (tso) free = 0x7ffff6b0ac00 (+0x80) # overall, free += 0x400 medium (is fine) size = 0x80 sizeOf(StgTSO) = 0x78 (0xf * 8) free = 0x7ffff6b0a800 allocate(0x71) = 0x7ffff6b0a800 (stack) free = 0x7ffff6b0ab88 (+0x388 == 0x71 * 8) allocate(0xf) = 0x7ffff6b0ab88 (tso) free = 0x7ffff6b0ac00 (+0x78) # overall, free += 0x400 }}} -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Changes (by liyang): * cc: hackage.haskell.org@… (added) -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by simonmar): Great progress! If we used actual time rather than scheduler quanta, wouldn't that fix the heap-check case too? Calling `gettimeofday` is very cheap these days, cheap enough that we can call it every time we emit an event in the event log, so I'm sure it won't be a problem to call it in the scheduler. My guess is that with `threads003` we're just pushing the size up enough that the GC has to touch another cache line per TSO. If that's the case, then I'm not ''too'' worried, since this is a microbenchmark and won't affect real world performance much. But we should verify that that's what's going on. It can't be that we're allocating another stack chunk, because that would show up in the allocations. Check whether the amount of memory copied by the GC goes up in proportion to the amount you added to the TSO struct. If it is a lot more, then something else is going on. Also try some measurements with `perf`? Once this is squashed (or accounted for), we can move on to checking that priorities work right, and looking at the primops and the user API. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): Copied memory is affected disproportionately: {{{ [ezyang@hs01 threads006]$ ./Main-big 200000 +RTS -s 224,198,416 bytes allocated in the heap 853,296,496 bytes copied during GC 197,453,072 bytes maximum residency (10 sample(s)) 25,701,000 bytes maximum slop 426 MB total memory in use (0 MB lost due to fragmentation) [ezyang@hs01 threads006]$ ./Main-medium 200000 +RTS -s 224,198,416 bytes allocated in the heap 690,756,096 bytes copied during GC 125,998,552 bytes maximum residency (9 sample(s)) 24,121,032 bytes maximum slop 270 MB total memory in use (0 MB lost due to fragmentation) }}} That's an over 150MB jump. I think I have an important clue, though; consider these three runs on the big TSO executable: {{{ [ezyang@hs01 threads006]$ ./Main-big 182260 +RTS -s 204,318,544 bytes allocated in the heap 622,311,352 bytes copied during GC 110,659,024 bytes maximum residency (9 sample(s)) 23,499,512 bytes maximum slop 239 MB total memory in use (0 MB lost due to fragmentation) [ezyang@hs01 threads006]$ ./Main-big 182261 +RTS -s 204,319,648 bytes allocated in the heap 622,313,432 bytes copied during GC 110,659,024 bytes maximum residency (9 sample(s)) 23,453,024 bytes maximum slop 239 MB total memory in use (0 MB lost due to fragmentation) [ezyang@hs01 threads006]$ ./Main-big 182262 +RTS -s 204,320,752 bytes allocated in the heap 791,958,776 bytes copied during GC 169,695,072 bytes maximum residency (10 sample(s)) 23,503,128 bytes maximum slop 395 MB total memory in use (0 MB lost due to fragmentation) }}} Wow! In the case of the 182262th thread, it's literally the straw that broke the camel's back. I went ahead and logged GC events. The breakpoint occurs here: {{{ ---------------------------------------------------------- Gen Max Mut-list Blocks Large Live Slop Blocks Bytes Objects ---------------------------------------------------------- 0 53762 0 2 0 1184 7008 1 53762 3 53766 94 196721432 23504104 ---------------------------------------------------------- 196722616 23511112 ---------------------------------------------------------- Memory inventory: gen 0 blocks : 1 blocks ( 0.0 MB) gen 1 blocks : 53766 blocks ( 210.0 MB) nursery : 129 blocks ( 0.5 MB) retainer : 0 blocks ( 0.0 MB) arena blocks : 0 blocks ( 0.0 MB) exec : 0 blocks ( 0.0 MB) free : 6332 blocks ( 24.7 MB) total : 60228 blocks ( 235.3 MB) alloc new todo block 0x7f13c7aff000 for gen 0 alloc new todo block 0x7f13c7ca1000 for gen 1 [snip] ---------------------------------------------------------- Gen Max Mut-list Blocks Large Live Slop Blocks Bytes Objects ---------------------------------------------------------- 0 82858 0 2 0 104 8088 1 82858 1 45569 3 169695176 16955448 ---------------------------------------------------------- 169695280 16963536 ---------------------------------------------------------- Memory inventory: gen 0 blocks : 1 blocks ( 0.0 MB) gen 1 blocks : 45569 blocks ( 178.0 MB) nursery : 129 blocks ( 0.5 MB) retainer : 0 blocks ( 0.0 MB) arena blocks : 0 blocks ( 0.0 MB) exec : 0 blocks ( 0.0 MB) free : 53841 blocks ( 210.3 MB) total : 99540 blocks ( 388.8 MB) }}} Inside the snip, approximately a 100 hundred megablocks are allocated as TODO space for scavenge, ala {{{ allocated 1 megablock(s) at 0x7f13b5a00000 alloc new todo block 0x7f13b5a04000 for gen 1 increasing limit for 0x7f13b5a04000 to 0x7f13b5a04800 }}} It seems we run out of todo blocks while evacuating, and then essentially need to copy our entire heap. Ouch! Unfortunately, I don't know enough to say if this is a GC bug or not. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by simonmar): Correct me if I'm wrong, but in the "slow" case you're just doing one more major GC than in the fast case, right? The extra thread just pushed allocation over the threshold for the next GC. If that's the case, then this is not a problem at all. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): Oh yes, you're right! (I was looking at the Gen 0 counts; but for Gen 1 it went from 9 to 10.) -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:23 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): OK, here's where we stand: * The sorted list implementation performs comparably to the original when no priority changes are involved. However, it does not apply sleeper fairness (give threads that were sleeping extra timeslices), and when sleeper fairness is applied, we suffer from performance degradation (since operations on the sorted list are no longer constant time). So if this implementation wants to go anywhere, it needs to be upgraded into something like a skip list, which can efficiently search and remove things inside the structure. * I've stamped out some of the performance bugs from the array implementation (mostly by adding an extra queue which we use to handle "must run now!" threads). However, we suffer from a huge bottleneck deleteMinPQueue(), which by the accounts of perf takes up 22% of the runtime. Apparently logarithmically many comparisons/memory writes cost a lot when a lot of threads are involved. I am going to look for something more high performance. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:24 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): Some other interesting results about sleeper fairness: for most benchmarks, it doesn't have much effect, but threads003 displays some interesting behavior (this comparison is before and after sleeper fairness is applied): {{{ Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- threads003 -0.0% +9.0% -8.8% -8.9% +20.2% }}} So, changing how we schedule blocked threads (i.e. giving them a priority boost) results in a 10% speedup, but we end up using 20% more memory! A very interesting tradeoff, if you ask me... -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:25 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): Erm, sorry, that's not quite right. Here is a better set of numbers comparing without and with sleeper fairness. {{{ callback001 -0.0% +0.0% -8.8% -8.6% +0.0% callback002 -0.0% +0.0% -2.9% -3.0% +0.0% chan -0.0% +0.0% -0.3% -0.1% -0.6% sieve -0.0% +0.0% -0.9% -0.7% +0.0% threads001 -0.0% +0.0% -1.7% -1.8% +0.0% threads003 -0.0% +0.0% -0.2% -0.2% +3.4% threads006 -0.0% +0.0% +0.6% +0.0% +0.0% threads007 -0.0% +7.1% -3.1% -2.8% -47.0% }}} -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:26 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): OK, here's a new progress report. The red-black tree implementation is now behaving very consistently and with only 1-3% overhead. (Right now, it uses a malloc'd free list of red- black tree nodes; this ended up being quicker than using heap allocated stuff.) Unfortunately, sleeper fairness (with the objective of giving sleepy threads a bit of a boost) makes all of the benchmarks perform worse; in part due to the fact that sleepy threads increase the overhead of managing processes, since they tend to have different priorities. I don't at the moment have a good way of measuring latency change, see #7659 for a bug related to that. Additionally, the impact of scheduling is fairly hard on some test programs I’ve constructed unless you insert explicit yields into your programs. One more worse thing is that the the rb-trees are synthetically constructed to get a performance boost when all threads have the same priority; when threads have different priorities you will immediately see a slowdown because more rb-tree manipulations will become necessary.) The array heap implementation has lower overhead (0-2%) but is far more inconsistent; some benchmarks do much worse. Right now it’s still pretty unclear where all of this overhead is coming from. The end conclusion is as follows: if red-black trees end up being the dominant implementation, then there is still a chance that we can replace the current scheduler wholesale, since its runtime characteristics are very similar to the current scheduler. (Sleeper fairness will need to continue to be a configurable option, assuming we can prove that it improves latency.) However, if array heaps end up having much better properties on prioritized workloads (test-cases of which I have not constructed yet), I find it highly implausible that we’ll be able to finesse the implementation to work identically to the current scheduler, and we will need to support both. On the plus side, because I have made so many implementations I have a good sense refactor the API properly. I’ll attach some patches, but the comments still need some cleaning up. rbtrees: {{{ -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- callback001 +0.2% +0.0% +2.2% +1.9% -2.0% callback002 +0.2% +0.0% +1.7% +1.6% +0.0% chan +0.2% +0.0% +2.1% +2.1% +0.6% sieve +0.2% +0.0% +2.5% +2.6% +0.0% threads001 +0.2% +0.0% +1.7% +1.5% +0.0% threads003 +0.2% +0.0% -8.9% -8.8% +12.7% threads006 +0.2% +0.0% +1.6% +0.8% -3.2% threads007 +0.2% -0.0% +2.9% +3.0% -0.0% -------------------------------------------------------------------------------- Min +0.2% -0.0% -8.9% -8.8% -3.2% Max +0.2% +0.0% +2.9% +3.0% +12.7% Geometric Mean +0.2% +0.0% +0.7% +0.5% +0.9% }}} heap arrays (using emulated ordering in order to recreate thread conditions) {{{ -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- callback001 +0.1% +0.0% +2.1% +2.0% -2.0% callback002 +0.1% +0.0% +1.3% +1.3% +0.0% chan +0.1% +0.0% +0.5% +0.4% +0.0% sieve +0.1% +0.0% -0.7% -0.7% +0.0% threads001 +0.1% +0.0% -0.6% -0.7% +0.0% threads003 +0.1% +0.0% +62.0% +62.4% -3.2% threads006 +0.1% +0.0% +6.0% +5.8% -3.2% threads007 +0.1% +0.0% +6.7% +6.8% -8.3% -------------------------------------------------------------------------------- Min +0.1% +0.0% -0.7% -0.7% -8.3% Max +0.1% +0.0% +62.0% +62.4% +0.0% Geometric Mean +0.1% +0.0% +8.2% +8.2% -2.1% }}} -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:27 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by simonmar): Looking good, I'm still excited about this work. I really hope you'll put some of your benchmarks into `nofib/smp` - the current set may well be giving us a skewed view of what's important. I definitely agree that we want latency benchmarks too, but I haven't thought about how to construct them. Have you thought about keeping the red-black tree nodes in the heap? That's what I would do. Heap allocation is much faster than malloc, and reclamation is free. You would get some advantages for free, such as when parts of the tree are in the old generation they won't get touched during a minor GC. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:28 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Have you thought about keeping the red-black tree nodes in the heap? That's what I would do. Heap allocation is much faster than malloc, and reclamation is free. You would get some advantages for free, such as when
#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): Yep, I think any new benchmarks will probably be pretty important, maybe even more so than the actually stride scheduling :) parts of the tree are in the old generation they won't get touched during a minor GC. I’ve tried it, jerry-rigging StgMutArrPtrs. I concluded we’ll need to make a custom closure (ugh!) or make some improvements to how GHC garbage collects mutable objects (actually, I was planning on filing a bug, here we go #7662). In the case of mutable arrays, there are some funny characteristics: the write barriers are relatively cheap but they all get scanned because mutable arrays are permanently on the mutable list, so we don’t end up getting any benefits. (There aren’t any other existing mutable objects which can be jerry-rigged here.) Adding things to the mutable list may end up being expensive in the long run. The current malloc'd red-black implementation is pretty efficient; we maintain a free list and allocate them in blocks. I haven't implemented reclamation yet but even if we leak all of the free nodes it will only consume space up to the high watermark of number of simultaneous threads. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:29 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by simonmar): Do you have to use mutable objects? Why not do an Okasaki-style immutable red-black tree? In the heap, allocation is very cheap, mutation is not. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:30 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): No we don’t, but all of the C implementations that I can easily crib off of are mutable :) -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:31 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by njd): I'd like to add a word of caution to the inclusion of any form of priority/weighted scheduling into GHC at this time. Scheduling (unless it discards work) does not, a priori, change the total amount of work that has to be done, just its order. Such scheduling strategies don't create computational power, they just distribute the "disappointment" of not being first... This would imply that all of the timing changes reported in this thread are the result of interactions with other resources - different patterns of memory access, garbage collection etc. The community already recognises these sort of performance couplings and there several mechanisms (at a higher level) that can be used to manage the inevitable performance trades. I strongly believe that adding differential scheduling at this time will be, overall, counter productive to GHC's RTS development. It will create a new class of performance artefacts that will require resources to investigate. Time will be spent pushing trades around, chasing the proverbial bubble under the wall paper around and around. Please don't get me wrong, I am highly appreciative of Edward's efforts. The more overhead we can reduce in the scheduling, the better. If we want to start looking at more complex differential treatment of the allocation of supply (cpu cycles) to demand (task steps) then we need to consider the issues of how that trade interacts with the others. And how the overall trading space is going to manifest itself in terms of end-user goals. We also need to come up with a sensible approach of how to discard computational demand (load shedding) when the "system" becomes "excessively" infeasible to schedule to met those goals. Fortunately there are approaches to this (stuff I've been involved with in data networks for many years) - and Haskell's operational has all the right hooks in to do this. Interested PhD students, do get in touch.... -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:32 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): OK, after a brief break, I've started working on latency benchmarks to help evaluate how much alternative scheduling algorithms help. Of course, the first order of business was to establish some baselines, so I ran some tests on GHC 7.4 first, as well as a bare-bones C version of the application. The tests are here https://github.com/ezyang/latency ; eventually I'll integrate them with nofib (although it's not obvious how to integrate them into the current nofib reporting framework). The basic method is to write a byte onto a Unix domain socket; I'm sure I've botched some instrumentation somewhere so some eyeballs here would be useful. I can also make pretty graphs if people are interested. System info: {{{ [ezyang@hs01 latency]$ uname -a Linux hs01.scs.stanford.edu 3.5.3-1-ARCH #1 SMP PREEMPT Sun Aug 26 09:14:51 CEST 2012 x86_64 GNU/Linux }}} GHC 7.4.2 (as installed by Arch Linux) {{{ 0% 50% 90% 99% 99.9% 99.99% 0.000016 0.000021 0.000064 0.000096 0.000157 0.000175 }}} GHC 7.4.2 threaded -N1 {{{ 0% 50% 90% 99% 99.9% 99.99% 0.000048 0.000096 0.000191 0.000221 0.000260 0.000301 }}} GHC 7.4.2 threaded -N2 {{{ 0% 50% 90% 99% 99.9% 99.99% 0.000054 0.000208 0.000264 0.000307 0.000338 0.000370 }}} GHC 7.7.20130122 (prior to the newest latency IO manager) {{{ 0% 50% 90% 99% 99.9% 99.99% 0.000015 0.000045 0.000093 0.000117 0.000153 0.000169 }}} GHC 7.7.20130122 (prior to the newest latency IO manager) {{{ 0% 50% 90% 99% 99.9% 99.99% 0.000015 0.000045 0.000093 0.000117 0.000153 0.000169 }}} GHC 7.7.20130122 -threaded {{{ 0% 50% 90% 99% 99.9% 99.99% 0.000044 0.000058 0.000177 0.000219 0.000252 0.000281 }}} GHC 7.7.20130122 -threaded -N2 {{{ 0% 50% 90% 99% 99.9% 99.99% 0.000053 0.000210 0.000258 0.000298 0.000335 0.000376 }}} I'm building a version of GHC with the latest IO manager stuff and we'll see how that changes things... -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:33 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): Aaand the latest numbers! It looks like the new IO manager does a lot better! GHC 7.7.20130221 {{{ 0% 50% 90% 99% 99.9% 99.99% 0.000009 0.000019 0.000059 0.000103 0.000130 0.000926 }}} GHC 7.7.20130221 -threaded {{{ 0% 50% 90% 99% 99.9% 99.99% 0.000026 0.000046 0.000090 0.000167 0.000220 0.001286 }}} GHC 7.7.20130221 -threaded -N2 {{{ 0% 50% 90% 99% 99.9% 99.99% 0.000026 0.000055 0.000101 0.000179 0.000253 0.000927 }}} -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:34 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by simonmar): Something is a bit suspicious about those numbers. Without `-threaded` the IO manager is not used at all, yet your second set of numbers are quite different from the first even without `-threaded`. Furthermore, the second set of numbers seem to have a lot more outliers (the 99.99% slot is much higher). (BTW I think this is great stuff nonetheless). -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:35 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): Hmm, it looks like the GHC 7.7.20130122 numbers are nonsense (I used what I thought was a perf build, but it looks like it actually was just a validate build, so libraries/rts may not have been as optimized as they could be. Hmm, I guess I should go re-run the nofib numbers too, since some of those might be suspicious.) The 7.4.2 numbers, being distro built, should still be valid. Oh, and here are the C numbers, on the same machine. These should roughly correspond directly to the domain socket overhead. {{{ 0% 50% 90% 99% 99.9% 99.99% 0.000006 0.000014 0.000017 0.000028 0.000062 0.000259 }}} -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:36 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): When I attempt to replicate the conditions of #3838 (by removing the thread bump), I do not see any appreciable increase in runtime. This occurs even if I explicitly yield while inside the atomicModifyIORef (using unsafePerformIO), or I increase the number of threads being run. My conclusion is that a small pileup does occur, but since there's no longer a global list we keep traversing it's not expensive. But I can't rule out (1) the test-case having enough incidental other changes that have made it stop tickling the bug, or (2) some other RTS change masking over this particular problem. A conundrum! -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:37 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by simonmar): Edward: if I recall correctly, the thread bump was only a minor optimisation, the real fix for #3838 was the redesign of black hole blocking so that we didn't have a global list. So that would agree with your findings. I don't have a lot of evidence that bumping the blockee is a good idea. I was hoping you might find some :-) -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:38 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): Well, bumping the blocker does seem to help a little bit. Here's what happens when I remove the bump: {{{ -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- callback001 -0.0% +0.0% +5.6% +5.0% +0.0% callback002 -0.0% +0.0% +0.3% +0.9% +0.0% chan -0.0% +0.0% +1.4% +1.3% +0.0% sieve -0.0% +0.0% +4.1% +5.1% +4.2% threads001 -0.0% +0.0% -1.8% -1.3% +0.0% threads003 -0.0% -0.0% +0.4% +0.7% +0.0% threads006 -0.0% +0.0% +0.0% +0.6% +0.0% threads007 -0.0% +0.0% -1.3% -0.7% +0.0% -------------------------------------------------------------------------------- Min -0.0% -0.0% -1.8% -1.3% +0.0% Max -0.0% +0.0% +5.6% +5.1% +4.2% Geometric Mean -0.0% -0.0% +1.0% +1.4% +0.5% }}} -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:39 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Changes (by nicolast): * cc: ikke+ghc@… (added) -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:40 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC