Re: [GHC] #9221: (super!) linear slowdown of parallel builds on 40 core machine

19 Aug 2016

      #9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
        Reporter:  carter            |                Owner:
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:  8.2.1
       Component:  Compiler          |              Version:  7.8.2
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Compile-time      |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #910, #8224       |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by slyfox):

 I've experimented a bit more with trying to pin down where slowdown comes
 from.

 Some observations:

 Observation 1. -j <K> not only allows <K> modules to be compiled at the
 same time, but also enables:
   - <K> Capabilities
   - and <K> garbage collection threads

 I've locally removed Capability adjustment from -j handling
 and used -j <K> +RTS -N. That does not make performance as
 bad with increasing K. That makes sense GC OS threads don't
 fight over the same cache.

 It would be nice if '''+RTS -N''' would have a precedence over -j option

 Observation 2. [Warning: I have no idea how parallel GC works].
    The more GC threads we have - the more chances are that one of
    GC threads will finish scanning it's part oh heap and will sit
    in sched_yield() loop on a free core while main GC thread waits
    for completion of other threads doing useful work.

    I've found it out by changing yieldThread() to print it's caller.
    Vast majority of calls comes from any_work():

 {{{
 static rtsBool
 any_work (void)
 {
     int g;
     gen_workspace *ws;

     gct->any_work++;

     write_barrier();

     // scavenge objects in compacted generation
     if (mark_stack_bd != NULL && !mark_stack_empty()) {
         return rtsTrue;
     }

     // Check for global work in any gen.  We don't need to check for
     // local work, because we have already exited scavenge_loop(),
     // which means there is no local work for this thread.
     for (g = 0; g < (int)RtsFlags.GcFlags.generations; g++) {
         ws = &gct->gens[g];
         if (ws->todo_large_objects) return rtsTrue;
         if (!looksEmptyWSDeque(ws->todo_q)) return rtsTrue;
         if (ws->todo_overflow) return rtsTrue;
     }

 #if defined(THREADED_RTS)
     if (work_stealing) {
         uint32_t n;
         // look for work to steal
         for (n = 0; n < n_gc_threads; n++) {
             if (n == gct->thread_index) continue;
             for (g = RtsFlags.GcFlags.generations-1; g >= 0; g--) {
                 ws = &gc_threads[n]->gens[g];
                 if (!looksEmptyWSDeque(ws->todo_q)) return rtsTrue;
             }
         }
     }
 #endif

     gct->no_work++;
 #if defined(THREADED_RTS)
     yieldThread("any_work");
 #endif

     return rtsFalse;
 }
 }}}

 I need to dig more into how parallel GC traverses heap to understand how
 much of a problem it is.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:53
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler