
#7606: Stride scheduling for Haskell threads with priorities ---------------------------------+------------------------------------------ Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: None/Unknown Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by ezyang): Summary: SMP benchmarks show massive degradation of performance, so probably one of the performance bugs Simon Marlow pointed out is really killing us and needs to be fixed. Replying to [comment:1 simonmar]:
* First, nofib has zero concurrency benchmarks in it, so the results aren't very meaningful. Still, you seem to have some outliers in there - any idea why? nofib/smp has some concurrency benchmarks that I normally use when making changes to the scheduler.
Oops, I forgot that normal nofib doesn't do smp. SMP doesn't look so good, and it seems likely that some of the performance problems you've noted below might be contributing. I'll investigate more thoroughly... {{{ -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- callback001 +0.1% +0.0% +7.7% +7.6% +0.0% callback002 +0.1% +0.0% +10.7% +10.7% +0.0% chan +0.1% +0.0% -29.0% -29.1% -71.0% sieve +0.1% +0.0% +141.6% +142.0% +89.3% threads001 +0.1% +0.0% +12.7% +12.6% +0.0% threads003 +0.1% +9.1% +44.7% +44.6% +25.8% threads006 +0.1% +0.0% +30.3% +30.7% +46.6% threads007 +0.1% -0.2% +17.0% +17.0% -6.8% -------------------------------------------------------------------------------- Min +0.1% -0.2% -29.0% -29.1% -71.0% Max +0.1% +9.1% +141.6% +142.0% +89.3% Geometric Mean +0.1% +1.1% +22.5% +22.5% -0.7% }}}
* Run all the tests in `testsuite/concurrent`. They aren't run by default in a validate.
* The patch adds no less than 4 fields to the TSO struct, which is quite a lot. We should benchmark carefully to make sure this isn't a
Yup, I've been running a full testsuite run and will be checking those results today. problem. I can reduce this by one by paying the cost of a division every time we're scheduled (since stride = STRIDE1/tickets, always); we can reduce it by one more by observing that pass and remain are never both set at the same time and turning it into a union. I'm not sure what benchmarks in particular would start failing due to the increase in size, other than just general badness.
* Why stride scheduling rather than something else, e.g. Linux's CFS? Wouldn't CFS address the issue of giving more time to blocked threads? What about inaccuracies due to the time it takes for a thread to respond to the timer signal?
* Currently, if there's a very long run queue, a minor GC won't have to
Here's the secret: Linux's CFS is stride scheduling! The primary differences here are the backing data structure (they use a red-black tree while we're using a simple heap) and how timeslices are measured (Linux counts nanoseconds; we stupidly just consider a jump to the scheduler one quantum.) It is possible to implement CFS-style sleeper fairness, by modifying what happens to a processes' pass when it sleeps. I'll post a version with those changes. traverse a lot of it because it will be in the old generation, but here the GC has to look at every element of the priority queue. We should make a benchmark with a long run queue and see if that's an issue. Could be a problem. If we switched to a tree data structure, we could probably recover this behavior (the GC will only have to traverse log n).
* `promoteInRunQueue`: it would be a shame to lose this, but as long as we're sure that #3838 hasn't regressed it should be ok.
There is no particular reason why we can't implement this, and if we switch to a red-black tree it gets easier.
* could we have some more docs on the scheduler functions, `capPassUpdate` etc. please?
OK.
* Looks like we should do `annulTSO` in a better way. These fields should be empty normally, set when a thread is blocked, and unset again when the thread is unblocked.
OK.
* I don't see why `sched_mutex` is needed in `setTickets` and co. Could you explain?
setTickets access can be concurrent, so since a read and a write occur, it needs to be protected by a lock. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler