Re: [GHC] #7606: Stride scheduling for Haskell threads with priorities

18 Jan 2013

      #7606: Stride scheduling for Haskell threads with priorities
---------------------------------+------------------------------------------
    Reporter:  ezyang            |       Owner:  ezyang          
        Type:  feature request   |      Status:  new             
    Priority:  normal            |   Milestone:  7.8.1           
   Component:  Runtime System    |     Version:  7.7             
    Keywords:                    |          Os:  Unknown/Multiple
Architecture:  Unknown/Multiple  |     Failure:  None/Unknown    
  Difficulty:  Unknown           |    Testcase:                  
   Blockedby:                    |    Blocking:                  
     Related:                    |  
---------------------------------+------------------------------------------

Comment(by ezyang):

 Summary: SMP benchmarks show massive degradation of performance, so
 probably one of the performance bugs Simon Marlow pointed out is really
 killing us and needs to be fixed.

 Replying to [comment:1 simonmar]:
...
* First, nofib has zero concurrency benchmarks in it, so the results
 aren't very meaningful.  Still, you seem to have some outliers in there -
 any idea why?  nofib/smp has some concurrency benchmarks that I normally
 use when making changes to the scheduler.
Oops, I forgot that normal nofib doesn't do smp. SMP doesn't look so good,
 and it seems likely that some of the performance problems you've noted
 below might be contributing. I'll investigate more thoroughly...

 {{{
 --------------------------------------------------------------------------------
         Program           Size    Allocs   Runtime   Elapsed  TotalMem
 --------------------------------------------------------------------------------
     callback001          +0.1%     +0.0%     +7.7%     +7.6%     +0.0%
     callback002          +0.1%     +0.0%    +10.7%    +10.7%     +0.0%
            chan          +0.1%     +0.0%    -29.0%    -29.1%    -71.0%
           sieve          +0.1%     +0.0%   +141.6%   +142.0%    +89.3%
      threads001          +0.1%     +0.0%    +12.7%    +12.6%     +0.0%
      threads003          +0.1%     +9.1%    +44.7%    +44.6%    +25.8%
      threads006          +0.1%     +0.0%    +30.3%    +30.7%    +46.6%
      threads007          +0.1%     -0.2%    +17.0%    +17.0%     -6.8%
 --------------------------------------------------------------------------------
             Min          +0.1%     -0.2%    -29.0%    -29.1%    -71.0%
             Max          +0.1%     +9.1%   +141.6%   +142.0%    +89.3%
  Geometric Mean          +0.1%     +1.1%    +22.5%    +22.5%     -0.7%
 }}}
...
* Run all the tests in `testsuite/concurrent`.  They aren't run by
 default in a validate.
...
* The patch adds no less than 4 fields to the TSO struct, which is
 quite a lot.  We should benchmark carefully to make sure this isn't a
Yup, I've been running a full testsuite run and will be checking those
 results today.

 problem.

 I can reduce this by one by paying the cost of a division every time we're
 scheduled (since stride = STRIDE1/tickets, always); we can reduce it by
 one more by observing that pass and remain are never both set at the same
 time and turning it into a union.  I'm not sure what benchmarks in
 particular would start failing due to the increase in size, other than
 just general badness.
...
* Why stride scheduling rather than something else, e.g. Linux's CFS?
 Wouldn't CFS address the issue of giving more time to blocked threads?
 What about inaccuracies due to the time it takes for a thread to respond
 to the timer signal?
...
* Currently, if there's a very long run queue, a minor GC won't have to
Here's the secret: Linux's CFS is stride scheduling! The primary
 differences here are the backing data structure (they use a red-black tree
 while we're using a simple heap) and how timeslices are measured (Linux
 counts nanoseconds; we stupidly just consider a jump to the scheduler one
 quantum.)

 It is possible to implement CFS-style sleeper fairness, by modifying what
 happens to a processes' pass when it sleeps. I'll post a version with
 those changes.

 traverse a lot of it because it will be in the old generation, but here
 the GC has to look at every element of the priority queue.  We should make
 a benchmark with a long run queue and see if that's an issue.

 Could be a problem. If we switched to a tree data structure, we could
 probably recover this behavior (the GC will only have to traverse log n).
...
* `promoteInRunQueue`: it would be a shame to lose this, but as long as
 we're sure that #3838 hasn't regressed it should be ok.
There is no particular reason why we can't implement this, and if we
 switch to a red-black tree it gets easier.
...
* could we have some more docs on the scheduler functions,
 `capPassUpdate` etc. please?
OK.
...
* Looks like we should do `annulTSO` in a better way.  These fields
 should be empty normally, set when a thread is blocked, and unset again
 when the thread is unblocked.
OK.
...
* I don't see why `sched_mutex` is needed in `setTickets` and co.
 Could you explain?
setTickets access can be concurrent, so since a read and a write occur, it
 needs to be protected by a lock.

-- 
Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7606#comment:3
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler