Re: [GHC] #9221: (super!) linear slowdown of parallel builds on 40 core machine

31 Aug 2016

      #9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
        Reporter:  carter            |                Owner:
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:  8.2.1
       Component:  Compiler          |              Version:  7.8.2
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Compile-time      |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #910, #8224       |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by slyfox):

 Used the following GNUMakefile for '''./synth.bash''' to compare separate
 processes:
 {{{
 OBJECTS := $(patsubst %.hs,%.o,$(wildcard src/*.hs))

 all: $(OBJECTS)

 src/%.o: src/%.hs
         ~/dev/git/ghc-perf/inplace/bin/ghc-stage2 -c +RTS -A256M -RTS $<
 -o $@

 clean:
         $(RM) $(OBJECTS)

 .PHONY: clean
 }}}

 CPU topology:
 {{{
 $ lstopo-no-graphics
 Machine (30GB)
   Socket L#0 + L3 L#0 (8192KB)
     L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
       PU L#0 (P#0)
       PU L#1 (P#4)
     L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
       PU L#2 (P#1)
       PU L#3 (P#5)
     L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
       PU L#4 (P#2)
       PU L#5 (P#6)
     L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
       PU L#6 (P#3)
       PU L#7 (P#7)

 $ numactl -H
 available: 1 nodes (0)
 node 0 cpus: 0 1 2 3 4 5 6 7
 node 0 size: 31122 MB
 node 0 free: 28003 MB
 node distances:
 node   0
   0:  10
 }}}

 Separate processes:

 {{{
 $ make clean; time make -j1
 real    1m2.561s
 user    0m56.523s
 sys     0m5.560s

 $ make clean; time taskset --cpu-list 0-3 make -j4

 real    0m18.756s
 user    1m7.758s
 sys     0m6.460s

 $ make clean; time make -j4

 real    0m18.936s
 user    1m7.549s
 sys     0m6.857s

 $ make clean; time make -j6
 real    0m17.365s
 user    1m32.107s
 sys     0m9.155s

 $ make clean; time make -j8

 real    0m15.964s
 user    1m52.058s
 sys     0m9.929s
 }}}

 The speedup compared to -j1 is almost exactly 4x, but it happens on -j
 higher than 4
 as well. Using CPU affinity makes things better on -j4.

 {{{
 $ ./synth.bash -j1 +RTS -sstderr -A256M -qb0 -RTS

 real    0m51.702s
 user    0m50.840s
 sys     0m0.844s

 $ ./synth.bash -j4 +RTS -sstderr -A256M -qb0 -RTS

 real    0m17.526s
 user    1m6.978s
 sys     0m1.412s

 $ ./synth.bash -j4 +RTS -sstderr -A256M -qb0 -qa -RTS

 real    0m17.007s
 user    1m4.867s
 sys     0m1.508s

 $ ./synth.bash -j8 +RTS -sstderr -A256M -qb0 -RTS

 real    0m13.829s
 user    1m44.295s
 sys     0m2.669s

 $ ./synth.bash -j8 +RTS -sstderr -A256M -qb0 -qa -RTS

 real    0m14.597s
 user    1m43.145s
 sys     0m3.285s
 }}}

 The speedup compared to -j1 is around 3.5x, also happens on -j higher than
 4.
 Using CPU affinity makes things worse on -j4.

 In absolute times '''ghc --make -j''' is slightly better that separate
 processes
 due to less startup(?) overhead. But something else slowly creeps up and
 we don't
 see 4x factor.

 It's more visible on 24-core VM, will post in a few minutes.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:65
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler