Re: [GHC] #9221: (super!) linear slowdown of parallel builds on 40 core machine

23 Aug 2016

      #9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
        Reporter:  carter            |                Owner:
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:  8.2.1
       Component:  Compiler          |              Version:  7.8.2
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Compile-time      |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #910, #8224       |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by slyfox):

 Replying to [comment:54 simonmar]:
...
@slyfox the GC threads all spin in this `any_work()` loop looking for
 work they can steal from other threads.  It's how the GC work gets
 distributed amongst the available worker threads.
Aha, thanks! I've read through
 http://community.haskell.org/~simonmar/papers/parallel-gc.pdf
 The paper does not look outdated.

 The crucial detail: work stealing is enabled by default for gen=1 and
 upper.
 As default nursery is tiny we don't do stealing from it. That's why I see
 poor
 GC parallelism on large nurseries for this compilation workload:

 [ The numbers below are for 4-CPU system, not 8-CPU i used to post before.
 I'll redo tests in a few days
 once I get access to it. ]

 -qb1/-qb0 comparison:

 -qb1 (default):
 {{{
 $ time ./mk.bash -j4 +RTS -N4 -RTS +RTS -sstderr -A768M -RTS -O0 -j4 +RTS
 -l

   55,162,983,632 bytes allocated in the heap
      639,999,688 bytes copied during GC
       88,807,552 bytes maximum residency (2 sample(s))
        4,100,072 bytes maximum slop
             3379 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0        20 colls,    20 par    5.617s   1.582s     0.0791s
 0.1410s
   Gen  1         2 colls,     1 par    0.227s   0.081s     0.0405s
 0.0500s

   Parallel GC work balance: 22.34% (serial 0%, perfect 100%)

   TASKS: 13 (1 bound, 12 peak workers (12 total), using -N4)

   SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

   INIT    time    0.027s  (  0.026s elapsed)
   MUT     time   59.245s  ( 18.963s elapsed)
   GC      time    5.843s  (  1.663s elapsed)
   EXIT    time    0.032s  (  0.034s elapsed)
   Total   time   65.174s  ( 20.686s elapsed)

   Alloc rate    931,102,797 bytes per MUT second

   Productivity  91.0% of total user, 91.8% of total elapsed

 gc_alloc_block_sync: spin=10830; yield=5
 whitehole_spin: 0
 gen[0].sync: spin=6297; yield=5
 gen[1].sync: spin=4240; yield=3

 real    0m20.863s
 user    1m5.752s
 sys     0m4.724s
 }}}

 -qb0:
 {{{
   55,154,416,472 bytes allocated in the heap
      840,077,056 bytes copied during GC
      135,018,976 bytes maximum residency (3 sample(s))
        5,739,552 bytes maximum slop
             2375 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0        31 colls,    31 par    3.078s   0.859s     0.0277s
 0.0682s
   Gen  1         3 colls,     2 par    0.607s   0.177s     0.0591s
 0.0947s

   Parallel GC work balance: 74.48% (serial 0%, perfect 100%)

   TASKS: 13 (1 bound, 12 peak workers (12 total), using -N4)

   SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

   INIT    time    0.023s  (  0.023s elapsed)
   MUT     time   59.476s  ( 18.325s elapsed)
   GC      time    3.685s  (  1.036s elapsed)
   EXIT    time    0.027s  (  0.028s elapsed)
   Total   time   63.244s  ( 19.412s elapsed)

   Alloc rate    927,345,481 bytes per MUT second

   Productivity  94.1% of total user, 94.5% of total elapsed

 gc_alloc_block_sync: spin=58847; yield=38
 whitehole_spin: 0
 gen[0].sync: spin=126024; yield=104
 gen[1].sync: spin=34648; yield=31

 real    0m19.575s
 user    1m6.804s
 sys     0m1.760s
 }}}

 Note how GC parallelism increases from ~20% to ~75% for this huge nursery
 of 768M.

 Something hinders mutator's parallelism still. threadscope says threads
 are usually
 blocked on MVars for long periods of time (order of 0.1-0.2 seconds).
 Perhaps those wait for external gcc/gas to assemble modules.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:55
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler