Re: [GHC] #9221: (super!) linear slowdown of parallel builds on 40 core machine

30 Aug 2016

      #9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
        Reporter:  carter            |                Owner:
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:  8.2.1
       Component:  Compiler          |              Version:  7.8.2
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Compile-time      |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #910, #8224       |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by slyfox):

 [warning: not a NUMA expert]

 Tl;DR:

 I think it depends on what exactly we hit as a bottleneck.

 I have a suspiction we saturate RAM bandwidth, not CPU ability
 to retire instructions due to hyperthreads. Basically GHC does
 too many non-local references and one of the ways to speed GHC
 up is either insrease memory locality or decrease HEAP usage.

 Long version:

 For a while I tried to figure out why exactly I don't see
 perfect scaling of '''ghc --make''' on my box.

 It's easy to see/compare with '''synth.bash +RTS -A256M -RTS''' benchmark
 ran with '''-j1''' / '''-j''' options.

 I don't have hard evidence but I suspect bottleneck is not due to
 hyperthreads/real core execution engines but due to RAM
 bandwidth limit on CPU-to-memory path. One of the hints
 is '''perf stat''':

 {{{
 $ perf stat -e cache-references,cache-
 misses,cycles,instructions,branches,faults,migrations,mem-loads,mem-stores
 ./synth.bash -j +RTS -sstderr -A256M -qb0 -RTS

  Performance counter stats for './synth.bash -j +RTS -sstderr -A256M -qb0
 -RTS':

      3 248 577 545      cache-references
 (28,64%)
        740 590 736      cache-misses              #   22,797 % of all
 cache refs      (42,93%)
    390 025 361 812      cycles
 (57,18%)
    171 496 925 132      instructions              #    0,44  insn per
 cycle                                              (71,45%)
     33 736 976 296      branches
 (71,47%)
          1 061 039      faults
              1 524      migrations
             67 895      mem-loads
 (71,42%)
     27 652 025 890      mem-stores
 (14,27%)

       15,131608490 seconds time elapsed
 }}}

 22% of all cache refs are misses. A huge number. I think it dominates
 performance
 (assuming memory access is ~100 times slower than CPU cache access), but I
 have no
 hard evidence :)

 I have 4 cores with x2 hyperthreads each and get best performance from
 -j8,
 not -j4 as one would expect from hyperthreads inctruction retirement:

 -j1: 55s; -j4: 18s; -j6: 15s; j8: 14.2s; -j10: 15.0s

 {{{
 ./synth.bash -j +RTS -sstderr -A256M -qb0 -RTS

   66,769,724,456 bytes allocated in the heap
    1,658,350,288 bytes copied during GC
      127,385,728 bytes maximum residency (5 sample(s))
        1,722,080 bytes maximum slop
             2389 MB total memory in use (0 MB lost due to fragmentation)

                                      Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0        31 colls,    31 par    6.535s   0.831s     0.0268s
 0.0579s
   Gen  1         5 colls,     4 par    1.677s   0.225s     0.0449s
 0.0687s

   Parallel GC work balance: 80.03% (serial 0%, perfect 100%)

   TASKS: 21 (1 bound, 20 peak workers (20 total), using -N8)

   SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

   INIT    time    0.002s  (  0.002s elapsed)
   MUT     time   87.599s  ( 12.868s elapsed)
   GC      time    8.212s  (  1.056s elapsed)
   EXIT    time    0.013s  (  0.015s elapsed)
   Total   time   95.841s  ( 13.942s elapsed)

   Alloc rate    762,222,437 bytes per MUT second

   Productivity  91.4% of total user, 92.4% of total elapsed

 gc_alloc_block_sync: 83395
 whitehole_spin: 0
 gen[0].sync: 280927
 gen[1].sync: 134537

 real    0m14.070s
 user    1m44.835s
 sys     0m2.899s
 }}}

 I've noticed that building GHC with '''-fno-worker-wrapper -fno-spec-
 constr'''
 makes GHC 4% faster (-j8) (memory allocations is 7% less, a bug #11565 is
 likely
 at fault) which also hints at memory throughput as a bottleneck.

 The conclusion:

 AFAIU, thus to make most of GHC we should strive for amount of active
 threads
 capable of saturating all the memory IO channels the machine has (but not
 much more).

 '''perf bench mem all''' suggests RAM bandwidth performance is in range of
 2-32GB/s
 depending how bad workload is. I would assume GHC workload is very non-
 linear (and thus bad).

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:61
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler