
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by slyfox): I've experimented a bit more with trying to pin down where slowdown comes from. Some observations: Observation 1. -j <K> not only allows <K> modules to be compiled at the same time, but also enables: - <K> Capabilities - and <K> garbage collection threads I've locally removed Capability adjustment from -j handling and used -j <K> +RTS -N. That does not make performance as bad with increasing K. That makes sense GC OS threads don't fight over the same cache. It would be nice if '''+RTS -N''' would have a precedence over -j option Observation 2. [Warning: I have no idea how parallel GC works]. The more GC threads we have - the more chances are that one of GC threads will finish scanning it's part oh heap and will sit in sched_yield() loop on a free core while main GC thread waits for completion of other threads doing useful work. I've found it out by changing yieldThread() to print it's caller. Vast majority of calls comes from any_work(): {{{ static rtsBool any_work (void) { int g; gen_workspace *ws; gct->any_work++; write_barrier(); // scavenge objects in compacted generation if (mark_stack_bd != NULL && !mark_stack_empty()) { return rtsTrue; } // Check for global work in any gen. We don't need to check for // local work, because we have already exited scavenge_loop(), // which means there is no local work for this thread. for (g = 0; g < (int)RtsFlags.GcFlags.generations; g++) { ws = &gct->gens[g]; if (ws->todo_large_objects) return rtsTrue; if (!looksEmptyWSDeque(ws->todo_q)) return rtsTrue; if (ws->todo_overflow) return rtsTrue; } #if defined(THREADED_RTS) if (work_stealing) { uint32_t n; // look for work to steal for (n = 0; n < n_gc_threads; n++) { if (n == gct->thread_index) continue; for (g = RtsFlags.GcFlags.generations-1; g >= 0; g--) { ws = &gc_threads[n]->gens[g]; if (!looksEmptyWSDeque(ws->todo_q)) return rtsTrue; } } } #endif gct->no_work++; #if defined(THREADED_RTS) yieldThread("any_work"); #endif return rtsFalse; } }}} I need to dig more into how parallel GC traverses heap to understand how much of a problem it is. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:53 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler