
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by slyfox): Replying to [comment:54 simonmar]:
@slyfox the GC threads all spin in this `any_work()` loop looking for work they can steal from other threads. It's how the GC work gets distributed amongst the available worker threads.
Aha, thanks! I've read through http://community.haskell.org/~simonmar/papers/parallel-gc.pdf The paper does not look outdated. The crucial detail: work stealing is enabled by default for gen=1 and upper. As default nursery is tiny we don't do stealing from it. That's why I see poor GC parallelism on large nurseries for this compilation workload: [ The numbers below are for 4-CPU system, not 8-CPU i used to post before. I'll redo tests in a few days once I get access to it. ] -qb1/-qb0 comparison: -qb1 (default): {{{ $ time ./mk.bash -j4 +RTS -N4 -RTS +RTS -sstderr -A768M -RTS -O0 -j4 +RTS -l 55,162,983,632 bytes allocated in the heap 639,999,688 bytes copied during GC 88,807,552 bytes maximum residency (2 sample(s)) 4,100,072 bytes maximum slop 3379 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 20 colls, 20 par 5.617s 1.582s 0.0791s 0.1410s Gen 1 2 colls, 1 par 0.227s 0.081s 0.0405s 0.0500s Parallel GC work balance: 22.34% (serial 0%, perfect 100%) TASKS: 13 (1 bound, 12 peak workers (12 total), using -N4) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.027s ( 0.026s elapsed) MUT time 59.245s ( 18.963s elapsed) GC time 5.843s ( 1.663s elapsed) EXIT time 0.032s ( 0.034s elapsed) Total time 65.174s ( 20.686s elapsed) Alloc rate 931,102,797 bytes per MUT second Productivity 91.0% of total user, 91.8% of total elapsed gc_alloc_block_sync: spin=10830; yield=5 whitehole_spin: 0 gen[0].sync: spin=6297; yield=5 gen[1].sync: spin=4240; yield=3 real 0m20.863s user 1m5.752s sys 0m4.724s }}} -qb0: {{{ 55,154,416,472 bytes allocated in the heap 840,077,056 bytes copied during GC 135,018,976 bytes maximum residency (3 sample(s)) 5,739,552 bytes maximum slop 2375 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 31 colls, 31 par 3.078s 0.859s 0.0277s 0.0682s Gen 1 3 colls, 2 par 0.607s 0.177s 0.0591s 0.0947s Parallel GC work balance: 74.48% (serial 0%, perfect 100%) TASKS: 13 (1 bound, 12 peak workers (12 total), using -N4) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.023s ( 0.023s elapsed) MUT time 59.476s ( 18.325s elapsed) GC time 3.685s ( 1.036s elapsed) EXIT time 0.027s ( 0.028s elapsed) Total time 63.244s ( 19.412s elapsed) Alloc rate 927,345,481 bytes per MUT second Productivity 94.1% of total user, 94.5% of total elapsed gc_alloc_block_sync: spin=58847; yield=38 whitehole_spin: 0 gen[0].sync: spin=126024; yield=104 gen[1].sync: spin=34648; yield=31 real 0m19.575s user 1m6.804s sys 0m1.760s }}} Note how GC parallelism increases from ~20% to ~75% for this huge nursery of 768M. Something hinders mutator's parallelism still. threadscope says threads are usually blocked on MVars for long periods of time (order of 0.1-0.2 seconds). Perhaps those wait for external gcc/gas to assemble modules. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:55 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler