Re: [GHC] #9221: (super!) linear slowdown of parallel builds on 40 core machine

31 Aug 2016

      #9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
        Reporter:  carter            |                Owner:
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:  8.2.1
       Component:  Compiler          |              Version:  7.8.2
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Compile-time      |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #910, #8224       |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by slyfox):

 24-core VM.

 CPU topology:
 {{{
 $ lstopo-no-graphics
 Machine (118GB)
   Package L#0 + L3 L#0 (30MB)
     L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
       PU L#0 (P#0)
       PU L#1 (P#1)
     L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
       PU L#2 (P#2)
       PU L#3 (P#3)
     L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
       PU L#4 (P#4)
       PU L#5 (P#5)
     L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
       PU L#6 (P#6)
       PU L#7 (P#7)
     L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
       PU L#8 (P#8)
       PU L#9 (P#9)
     L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
       PU L#10 (P#10)
       PU L#11 (P#11)
     L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
       PU L#12 (P#12)
       PU L#13 (P#13)
     L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
       PU L#14 (P#14)
       PU L#15 (P#15)
     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
       PU L#16 (P#16)
       PU L#17 (P#17)
     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
       PU L#18 (P#18)
       PU L#19 (P#19)
     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
       PU L#20 (P#20)
       PU L#21 (P#21)
     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
       PU L#22 (P#22)
       PU L#23 (P#23)

 $ numactl -H
 available: 1 nodes (0)
 node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 node 0 size: 120881 MB
 node 0 free: 120192 MB
 node distances:
 node   0
   0:  10
 }}}

 (I would not trust numactl output).

 Separate processes:

 {{{
 $ make clean; time make -j1

 real    1m33.147s
 user    1m20.836s
 sys     0m11.556s

 $ make clean; time make -j10

 real    0m11.275s
 user    1m29.800s
 sys     0m12.856s

 $ make clean; time make -j12

 real    0m10.537s
 user    1m36.276s
 sys     0m16.948s

 $ make clean; time make -j14

 real    0m9.117s
 user    1m39.132s
 sys     0m18.332s

 $ make clean; time make -j20

 real    0m8.498s
 user    2m7.064s
 sys     0m17.912s

 $ make clean; time make -j22

 real    0m7.468s
 user    2m9.808s
 sys     0m18.592s

 $ make clean; time make -j24

 real    0m7.336s
 user    2m15.936s
 sys     0m19.004s

 $ make clean; time make -j26

 real    0m7.433s
 user    2m17.612s
 sys     0m19.648s

 $ make clean; time make -j28

 real    0m7.554s
 user    2m17.760s
 sys     0m19.564s

 $ make clean; time make -j30

 real    0m7.563s
 user    2m16.776s
 sys     0m21.104s

 }}}

 Numbers are jumping slightly from run to run but the gist is best
 performance is around -j24, not -j12.

 Single process:

 {{{
 $ ./synth.bash -j1 +RTS -sstderr -A256M -qb0 -RTS

 real    1m15.214s
 user    1m14.060s
 sys     0m0.984s

 $ ./synth.bash -j8 +RTS -sstderr -A256M -qb0 -RTS

 real    0m11.275s
 user    1m21.708s
 sys     0m2.912s

 $ ./synth.bash -j10 +RTS -sstderr -A256M -qb0 -RTS

 real    0m10.279s
 user    1m25.184s
 sys     0m3.664s

 $ ./synth.bash -j12 +RTS -sstderr -A256M -qb0 -RTS

 real    0m9.605s
 user    1m32.688s
 sys     0m4.292s

 $ ./synth.bash -j14 +RTS -sstderr -A256M -qb0 -RTS

 real    0m9.144s
 user    1m40.288s
 sys     0m4.964s

 $ ./synth.bash -j16 +RTS -sstderr -A256M -qb0 -RTS

 real    0m10.003s
 user    1m51.916s
 sys     0m6.604s

 $ ./synth.bash -j20 +RTS -sstderr -A256M -qb0 -RTS

 real    0m10.215s
 user    2m7.924s
 sys     0m8.208s

 $ ./synth.bash -j22 +RTS -sstderr -A256M -qb0 -RTS

 real    0m10.483s
 user    2m13.440s
 sys     0m10.456s

 $ ./synth.bash -j24 +RTS -sstderr -A256M -qb0 -RTS

 real    0m10.985s
 user    2m18.028s
 sys     0m10.780s

 $ ./synth.bash -j32 +RTS -sstderr -A256M -qb0 -RTS

 real    0m12.636s
 user    2m32.312s
 sys     0m14.508s
 }}}

 Here we see best numbers around -j12 and those are worse than multiprocess
 run.

 From '''perf record''' it's not very clear what happens.

 I'll try to get a 64-core VM next week and see if the effect will be
 visible there much better.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:66
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler