
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by slyfox): Used the following GNUMakefile for '''./synth.bash''' to compare separate processes: {{{ OBJECTS := $(patsubst %.hs,%.o,$(wildcard src/*.hs)) all: $(OBJECTS) src/%.o: src/%.hs ~/dev/git/ghc-perf/inplace/bin/ghc-stage2 -c +RTS -A256M -RTS $< -o $@ clean: $(RM) $(OBJECTS) .PHONY: clean }}} CPU topology: {{{ $ lstopo-no-graphics Machine (30GB) Socket L#0 + L3 L#0 (8192KB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#4) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#5) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#6) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#7) $ numactl -H available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 31122 MB node 0 free: 28003 MB node distances: node 0 0: 10 }}} Separate processes: {{{ $ make clean; time make -j1 real 1m2.561s user 0m56.523s sys 0m5.560s $ make clean; time taskset --cpu-list 0-3 make -j4 real 0m18.756s user 1m7.758s sys 0m6.460s $ make clean; time make -j4 real 0m18.936s user 1m7.549s sys 0m6.857s $ make clean; time make -j6 real 0m17.365s user 1m32.107s sys 0m9.155s $ make clean; time make -j8 real 0m15.964s user 1m52.058s sys 0m9.929s }}} The speedup compared to -j1 is almost exactly 4x, but it happens on -j higher than 4 as well. Using CPU affinity makes things better on -j4. {{{ $ ./synth.bash -j1 +RTS -sstderr -A256M -qb0 -RTS real 0m51.702s user 0m50.840s sys 0m0.844s $ ./synth.bash -j4 +RTS -sstderr -A256M -qb0 -RTS real 0m17.526s user 1m6.978s sys 0m1.412s $ ./synth.bash -j4 +RTS -sstderr -A256M -qb0 -qa -RTS real 0m17.007s user 1m4.867s sys 0m1.508s $ ./synth.bash -j8 +RTS -sstderr -A256M -qb0 -RTS real 0m13.829s user 1m44.295s sys 0m2.669s $ ./synth.bash -j8 +RTS -sstderr -A256M -qb0 -qa -RTS real 0m14.597s user 1m43.145s sys 0m3.285s }}} The speedup compared to -j1 is around 3.5x, also happens on -j higher than 4. Using CPU affinity makes things worse on -j4. In absolute times '''ghc --make -j''' is slightly better that separate processes due to less startup(?) overhead. But something else slowly creeps up and we don't see 4x factor. It's more visible on 24-core VM, will post in a few minutes. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:65 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler