[GHC] #9221: (super!) linear slowdown of parallel builds on 40 core machine
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------+------------------------------------------------- Reporter: | Owner: carter | Status: new Type: bug | Milestone: Priority: high | Version: 7.8.2 Component: | Operating System: Unknown/Multiple Compiler | Type of failure: Compile-time performance bug Keywords: | Test Case: Architecture: | Blocking: Unknown/Multiple | Difficulty: | Unknown | Blocked By: | Related Tickets: | -------------------------+------------------------------------------------- im seeing slowdowns in parallel builds of a (simple!!) 6 module project when I build it on a 40 core server i'm using for work. for any given ghc invocation with -jn, once n>10, i start to see a super linear slow down as a function of n heres some basic numbers at -j1 0m2.693s at -j4 0m2.507s at -j10 0m2.763s at -j25 0m12.634s at -j30 : 0m39.154s at -j40 : 0m57.511s at -j60 : 2m21.821s these timings are another 2-4x worse if ghc is invoked indirectly via cabal-install / setup.hs according to the linux utility latencytop, 100% of ghc's cpu time was spent on user-space lock contention when I did the -j40 invocation. the timing in the -j40 case stayed the same even when ghc was also passed -O0 (and -fforce-recomp to ensure it did the same ) a bit of experimentation makes me believe that in *ANY* cabalized project on a 40 core machine will exhibit this perf issue. cabal clean ; cabal configure --ghc-options="-j" ; cabal build -j1 should be enough to trigger the lock contention. That said, I'll try to cook up a minimal repro that i can share the source for post haste. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------------------+------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time performance bug | Unknown/Multiple Test Case: | Difficulty: Blocking: | Unknown | Blocked By: | Related Tickets: -------------------------------------------------+------------------------- Comment (by carter): so my understanding is that the parUpsweep function in compiler/main/GhcMake.hs is the main culprit / object to focus on for this performance issue, right? (literally the only thing thats tripped the the -jN ghc invocation ) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------------------+------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time performance bug | Unknown/Multiple Test Case: | Difficulty: Blocking: | Unknown | Blocked By: | Related Tickets: -------------------------------------------------+------------------------- Comment (by slyfox): I've noticed it as well. For superficial tests -jN>4 does not make much sense even on 8-core box. Attached a cabal build test (takes ~50 seconds to build). Use as: {{{ $ cd $ghc-source ./test-ghc-cabal.sh >build-log }}} Results for my machine: {{{ RUN jobs: 1 real 0m51.745s user 0m49.825s sys 0m1.553s RUN jobs: 2 real 0m34.254s user 0m57.075s sys 0m6.278s RUN jobs: 3 real 0m31.670s user 1m7.058s sys 0m10.970s RUN jobs: 4 real 0m32.008s user 1m10.548s sys 0m18.194s RUN jobs: 5 real 0m32.329s user 1m15.384s sys 0m27.939s RUN jobs: 6 real 0m33.993s user 1m25.190s sys 0m41.473s RUN jobs: 7 real 0m35.410s user 1m32.354s sys 0m51.201s RUN jobs: 8 real 0m36.111s user 1m42.945s sys 1m1.740s RUN jobs: 9 real 0m37.426s user 1m49.708s sys 1m7.805s RUN jobs: 10 real 0m40.149s user 2m0.625s sys 1m13.054s RUN jobs: 11 real 0m44.515s user 2m18.503s sys 1m21.783s RUN jobs: 12 real 0m44.393s user 2m25.161s sys 1m26.875s RUN jobs: 13 real 0m47.298s user 2m44.370s sys 1m29.611s RUN jobs: 14 real 0m52.647s user 3m16.386s sys 1m37.780s RUN jobs: 15 real 0m54.757s user 3m18.954s sys 1m45.547s RUN jobs: 16 real 0m58.655s user 3m49.732s sys 1m49.191s }}} Notice how sys time creeps up taking over performance gain on N>4. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------------------+------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time performance bug | Unknown/Multiple Test Case: | Difficulty: Blocking: | Unknown | Blocked By: | Related Tickets: -------------------------------------------------+------------------------- Changes (by slyfox): * cc: slyfox (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------------------+------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time performance bug | Unknown/Multiple Test Case: | Difficulty: Blocking: | Unknown | Blocked By: | Related Tickets: -------------------------------------------------+------------------------- Comment (by hvr): Replying to [comment:2 slyfox]:
Notice how sys time creeps up taking over performance gain on N>4.
It's also quite visible in the quick R plot I made: [[Image(graph1.png)]] -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------------------+------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time performance bug | Unknown/Multiple Test Case: | Difficulty: Blocking: | Unknown | Blocked By: | Related Tickets: -------------------------------------------------+------------------------- Changes (by tibbe): * cc: tibbe (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------------------+------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time performance bug | Unknown/Multiple Test Case: | Difficulty: Blocking: | Unknown | Blocked By: | Related Tickets: -------------------------------------------------+------------------------- Comment (by slyfox): Box I performed tests on is a 8-CPU one (4-cores x 2 threads each): {{{ Intel(R) Core(TM) i7-2700K CPU @ 3.50GHz }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------------------+------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time performance bug | Unknown/Multiple Test Case: | Difficulty: Blocking: | Unknown | Blocked By: | Related Tickets: 910 -------------------------------------------------+------------------------- Changes (by hvr): * related: => 910 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------------------+------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time performance bug | Unknown/Multiple Test Case: | Difficulty: Blocking: | Unknown | Blocked By: | Related Tickets: #910 -------------------------------------------------+------------------------- Changes (by hvr): * related: => #910 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: Compile- | Blocked By: time performance bug | Related Tickets: #910 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by gintas): I think I know what's going on here. If you look at parUpsweep in compiler/main/GhcMake.js, its argument n_jobs is used in two places: one is the initial value of the par_sem semaphore used to limit parallelization, and the other is a call to setNumCapabilities. The latter seems to be the cause of the slowdown. Note that setNumCapabilities is only invoked if the previous count of capabilities was 1. I used that to control for both settings independently, and it turns out that the runtime overhead is mostly independent of the semaphore value and highly influenced by capability count. I ran some experiments on a 16-CPU VM (picked a larger one deliberately to make the differences more pronounced). Running with jobs=4 & caps=4, a test took 37s walltime, jobs=4 & caps=16 took 51s, jobs=4 & caps=32 took 114s (344s of MUT and 1021s of GC!). The figures are very similar for jobs=16 and jobs=64. See attached log for more details (-sstderr output). It looks like the runtime GC is just inefficient when running with many capabilities, even if many physical cores are available. I'll try a few experiments to verify that this is a general pattern that is not specific to the GhcMake implementation. Logic and a few experiments indicate that it does not help walltime to set the number of jobs (semaphore value) higher than the number of capabilities, so there's not much we can do about those two parameters in the parUpsweep implementation other than capping n_jobs at some constant (probably <= 8). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: Compile- | Blocked By: time performance bug | Related Tickets: #910 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by gintas): While looking around I found #3758, and it appears that +RTS -qg (disable parallel GC) helps a lot with the superlinear overhead. For example, the benchmark above with jobs=24 & caps=24 without -qg took: real 1m3.596s user 6m31.072s sys 3m10.732s With -qg: real 0m47.747s user 1m33.352s sys 0m2.024s However, for smaller -j values -qg slightly increases walltime: 2 jobs: 44s without -qg, 46s with -qg 4 jobs: 37s vs 40s 8 jobs: 37s vs 41s 15 jobs: 42s vs 44s 16 jobs: 49s vs 44s (walltime crossover point for this 16-core machine) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: Compile- | Blocked By: time performance bug | Related Tickets: #910 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by gintas): Also #8224 looks like it could be related. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: Compile- | Blocked By: time performance bug | Related Tickets: #910 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by simonmar): We probably want to be running with larger heap sizes when there are lots of cores, to counteract the synchronization overhead of stop-the-world GC across many cores. e.g. `+RTS -A32m` at least. Check whether your machines really have N cores, or if N/2 of those are hyperthreaded cores. I don't recommend including hyperthreaded cores in the -N value you give to GHC, it's better to leave them free to soak up any extra load while letting the GC synchronize properly. @carter, when you said "100% of ghc's cpu time was spent on user-space lock contention when I did the -j40 invocation.", how did you discover that? Which lock in particular? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: Compile- | Blocked By: time performance bug | Related Tickets: #910 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by carter): Yeah, I definitely found that making the nursery larger helped A LOT, i'd done `-A50m` or `-A100m` on a few occasions, though I must admit I didn't think to do that for running GHC itself! Good point about hyperthreads, that would match my experience of only seeing good perf up to about 15-20ish for -N. This does raise the problem that the default -N flag should probably be set to the number of physical cores rather than hyper threaded cores, would that be a reasonable patch for me to write for GHC? (though of course theres the matter of correctly detecting that info in a portable way! ) I unfortunately dont have access anymore to that machine (or presently any SSH access to a mega core sized box) so I can't repro it to dig in deeper, I was using some cli tool who's name escapes me right now -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: Compile- | Blocked By: time performance bug | Related Tickets: #910 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by gintas): I ran some experiments with -A and it does help a lot with performance, but also increases peak memory usage. I observed continuous improvement all the way from -A1m to -A128m in terms of walltime (41s to 36s), but "total memory in use" also went up from 265MB to 2182MB. Not sure where the sweet spot is. -A seems to help especially if the number of capabilities exceeds the number of cores. With 32 capabilities on a 16 core machine, a -qg run took 50s, -A128m took 41s (still a penalty over 36s but not nearly as bad) and a vanilla run took almost 2min. Of course, total memory use with -A128m went up to 4388m... Looks like the sweet spot is somewhere in the middle. How about building ghc --with-rtsopts=-A16m by default? Are there any downsides to that? On my machine the "total memory in use" overhead for that is less than 2x (475MB vs 265MB) and it really helps especially in the degenerate cases. (The machine I was testing on had 16 real cores without any hyperthreading.) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: Compile- | Blocked By: time performance bug | Related Tickets: #910 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by gidyn): * cc: gidyn (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: Compile- | Blocked By: time performance bug | Related Tickets: #910 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by simonmar): You should never have more capabilities than cores. It's hard to know where to set the default on the memory-vs-time tradeoff curve. GHC has typically been quite conservative here. Personally I'd like to know why -j is hardly providing any benefit beyond 2-3 cores, regardless of heap size settings. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: Compile- | Blocked By: time performance bug | Related Tickets: #910 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by nh2): * cc: nh2 (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: Compile- | Blocked By: time performance bug | Related Tickets: #910 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by nh2): @simonmar How much do you think is this affected by #8224? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: Compile- | Blocked By: time performance bug | Related Tickets: #910 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by simonmar): @nh2 I don't think anyone has fully investigated what's going on in #8224. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910 | Differential Revisions: -------------------------------------+------------------------------------- Changes (by kolmodin): * cc: kolmodin (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910 | Differential Revisions: -------------------------------------+------------------------------------- Changes (by thoughtpolice): * milestone: 7.10.1 => 7.12.1 Comment: (Pushing back to 7.12 pending investigation on this ticket and #8224). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910 | Differential Revisions: -------------------------------------+------------------------------------- Changes (by erikd): * cc: erikd (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910 | Differential Revisions: -------------------------------------+------------------------------------- Comment (by ezyang): I was chatting with one of my colleagues about this problem recently, and they said something very provocative: if GHC is not scaling because there is some global mutable state (e.g. the NameCache) which all the threads are hitting, it doesn't make sense to try to fix the compiler to scale; you should just run multiple processes of the compiler in parallel (like GHC's build does). Guaranteed scaling! Do people agree with this viewpoint? Disagree? If we take this viewpoint seriously, we should spend more time improving GHC's ability to output the dependency analysis for other tools to then build. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:23 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910 | Differential Revisions: -------------------------------------+------------------------------------- Comment (by simonpj): Alternatively, give each thread its own `NameCache`. To do that, if one thread consumes a `ModIface` produced by another, it'd need to run over it, looking up all the `Names` to give them the `Unique` local to that thread. Not so very different to having entirely separate processes communicating through files; and the latter is perhaps easier to think about. So, yes. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:24 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910 | Differential Revisions: -------------------------------------+------------------------------------- Comment (by simonmar): @ezyang: I absolutely agree. Parallelism within a single process is always going to be harder to scale than separate processes. That said, there might still be low-hanging fruit (e.g. better default heap settings) for parallel compilation. I very much like `-j` in GHCi though - we cut the time to load a large project by 3x just by turning on `-j` in GHCi with some suitable heap settings (`-A256m`). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:25 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910 | Differential Revisions: -------------------------------------+------------------------------------- Comment (by yongqli): Will adding `-j9` to the `ghc-options:` field of a cabal project mitigate this in the mean time until this is fixed? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:26 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910 | Differential Revisions: -------------------------------------+------------------------------------- Comment (by bgamari): For future readers, `highlighting-kate` would likely be an excellent package to benchmark with as it has a large number of largely independent modules. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:27 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910 | Differential Revisions: -------------------------------------+------------------------------------- Comment (by slyfox): The selfcontained test contains 195 modules to build highlighting-kate: http://code.haskell.org/~slyfox/T9221__highlighting-kate-build- benchmark.tar.gz To run it you need to tweak a path to your 'ghc=' in '''mk.bash'''. On my 8 core box best results can be achieved by setting -A128M (or larger). The test is handy to run 'perf record' on it. My results for '''./mk.bash -j8 -A128M''': {{{ $ perf record -g ./mk.bash -j8 -A128M $ perf report -g + 15,42% 0,78% ghc_worker libc-2.21.so [.] __sched_yield + 15,08% 1,62% ghc_worker [kernel.vmlinux] [k] entry_SYSCALL_64 + 12,70% 0,38% ghc_worker [kernel.vmlinux] [k] sys_sched_yield + 10,91% 6,47% ghc_worker ghc-stage2 [.] clE_info + 10,50% 0,38% ghc_worker [kernel.vmlinux] [k] schedule + 9,14% 1,58% ghc_worker [kernel.vmlinux] [k] __schedule + 8,60% 0,00% ghc_worker [unknown] [.] 0x48032822f800c748 + 5,85% 5,64% ghc_worker ghc-stage2 [.] evacuate + 3,47% 0,94% ghc_worker ghc-stage2 [.] c7F_info + 2,91% 0,69% ghc_worker [kernel.vmlinux] [k] pick_next_task_fair + 2,70% 0,00% ghc_worker [unknown] [.] 0x0000000000000004 + 2,63% 2,28% ghc_worker ghc-stage2 [.] c2k_info + 2,55% 0,00% ghc_worker [unknown] [.] 0x2280f98148088b48 + 1,90% 0,00% ghc_worker [unknown] [.] 0x834807e283da8948 + 1,90% 0,00% as ld-2.21.so [.] _dl_sysdep_start + 1,89% 0,00% as ld-2.21.so [.] dl_main + 1,83% 0,15% as [kernel.vmlinux] [k] page_fault + 1,81% 0,47% as ld-2.21.so [.] _dl_relocate_object }}} If perf does not lie most of the time is spent cycling over sleeping kernel threads. '''clE_info''' is a 'INFO_TABLE(stg_BLACKHOLE,1,0,BLACKHOLE,"BLACKHOLE","BLACKHOLE")' -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:28 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910 | Differential Revisions: -------------------------------------+------------------------------------- Comment (by bgamari): slyfox, out of curiosity what sort of parallel speed-up did you observe in that test? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:29 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
I was chatting with one of my colleagues about this problem recently, and they said something very provocative: if GHC is not scaling because
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910 | Differential Revisions: -------------------------------------+------------------------------------- Comment (by nh2): Replying to [comment:23 ezyang]: there is some global mutable state (e.g. the NameCache) ...
Do people agree with this viewpoint? Disagree?
I disagree. Threads should almost always be more efficient to use as they allow to efficiently/easily share resources when it makes things faster, but that doesn't mean that we have to share all the things. Processes force us to not share anything. If building is faster with separate processes, then we should be able to achieve the same speed with threads by simply not sharing that thing that makes it slow and that processes force us to not share. However, I wouldn't be surprised if this isn't even the problem here. Replying to [comment:28 slyfox]:
If perf does not lie most of the time is spent cycling over sleeping kernel threads
This sounds much more like the problem. If I had to make a guess (and based on the very limited look I had into this issue last year) it feels like we are accidentally busy polling something somewhere. When I run some non-build Haskell stuff with `-RTS +N18` on the current generation of 18 core AWS instances, with many more Haskell threads than needed for building a 200 module project, and with shorter thread life times than in this case (e.g. let's say building a module takes around 0.5 seconds), that stuff scales pretty nicely, much better than ghc's `--make` scales here. This makes me think that we might be simply doing something wrong in the parallel upsweep code, and that the rest (compiler, runtime etc.) is doing quite OK. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:30 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910 | Differential Revisions: -------------------------------------+------------------------------------- Comment (by simonmar): It sounds like there might be a lot of threads hitting blackholes, with consequent context-switching churn. Finding out which blackhole would be good - probably something in a shared data structure (NameCache or the FastString table, perhaps). ThreadScope would be a good next step. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:31 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910 | Differential Revisions: -------------------------------------+------------------------------------- Comment (by bgamari): simonmar, can ThreadScope help point one in the direction of which blackhole is to blame? I was under the impression that blackhole performance debugging an area where our current tools fell a little short, but maybe I've missed something. Somewhere around here I have the beginnings of a patch adding support for profiling of blackhole block events but I wonder if our new DWARF capabilities might be a better fit for this sort of performance work. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:32 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
Reporter: carter | Owner:
Type: bug | Status: new
Priority: high | Milestone: 7.12.1
Component: Compiler | Version: 7.8.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #910 | Differential Revisions:
-------------------------------------+-------------------------------------
Comment (by slyfox):
Replying to [comment:29 bgamari]:
> slyfox, out of curiosity what sort of parallel speed-up did you observe
in that test?
At best I get ~2.5 speedup (-j8 -A128M):
- -j1 : 41.1s
- -j8 -A128M : 17.4s
Replying to [comment:31 simonmar]:
> It sounds like there might be a lot of threads hitting blackholes, with
consequent context-switching churn. Finding out which blackhole would be
good - probably something in a shared data structure (NameCache or the
FastString table, perhaps). ThreadScope would be a good next step.
Tried to run as: '''./mk.bash -j8 +RTS -A128m -l''':
http://code.haskell.org/~slyfox/T9221-A128M-j8-l.eventlog
I see a chain of 'thread yields', 'blocked on an MVar' there, but can't
draw any conclusions.
Can I somehow get callers of blackhole? I guess 'perf report -G' will lie
as stacks are lost, but here goes it's output:
{{{
Children Self Command Shared Object Symbol
+ 15,92% 0,81% ghc_worker libc-2.21.so [.]
__sched_yield
+ 15,54% 1,69% ghc_worker [kernel.vmlinux] [k]
entry_SYSCALL_64
+ 13,11% 0,40% ghc_worker [kernel.vmlinux] [k]
sys_sched_yield
- 10,85% 6,43% ghc_worker ghc-stage2 [.] clI_info
- 40,76% clI_info (stg_BLACKHOLE_info)
+ 2,23% c8E2_info (ghc_Pretty_reduceDoc_info)
+ 1,81% c9p9_info (ghc_Pretty_vcatzugo_info)
+ 1,65% cJw6_info (ghczmprim_GHCziClasses_divIntzh_info)
+ 1,62% c8zA_info (...others)
+ 1,62% ca_info
+ 1,39% apic_timer_interrupt
+ 1,34% c4nL_info
+ 1,33% c3ZK_info
+ 1,27% c2k_info
+ 1,04% cmKx_info
+ 1,03% cp2g_info
0,92% cp_info
+ 0,89% shV8_info
+ 0,88% strlen
+ 0,88% c9r1_info
+ 0,87% r63b_info
+ 0,79% s1P5_info
+ 0,73% c4IG_info
+ 0,69% c4uY_info
+ 0,63% r117_info
+ 0,62% sw4Y_info
0,62% cqIM_info
0,61% caEA_info
+ 0,58% c2l_info
0,57% cyqt_info
+ 0,52% c9xp_info
33,29% 0x480328785000c748
+ 1,82% 0x4b7202f9834807e1
+ 1,54% 0x4808588b48f8e083
+ 0,87% 0x438b482677000003
+ 0,87% 0x8d3b49307507c1f6
+ 0,82% 0x4640cd24fffc498b
+ 0,81% 0xf07f98348fc498b
+ 0,75% 0x9b820f02f8
+ 0,65% 0x24ff07e283da8948
+ 0,63% 0x11d820f0af883
+ 0,61% 0xf4b8b4807438b48
+ 0,53% 0x58c48349677202f8
+ 0,50% 0x48074b8b48d88948
+ 10,79% 0,38% ghc_worker [kernel.vmlinux] [k] schedule
+ 9,44% 1,60% ghc_worker [kernel.vmlinux] [k] __schedule
}}}
--
Ticket URL:
GHC
The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Revisions: -------------------------------------+------------------------------------- Changes (by thomie): * related: #910 => #910, #8224 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:34 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Revisions: -------------------------------------+------------------------------------- Changes (by kazu-yamamoto): * cc: kazu-yamamoto (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:35 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Revisions: -------------------------------------+------------------------------------- Comment (by simonmar): @bgamari, correct - ThreadScope won't tell us anything about the origin of blackholes, but it would confirm whether there is a lot of blocking on blackholes. Ideally we should see all the capabilities fully busy when compiling a perfectly parallel workload. Note that dependencies between modules reduce the amount of parallelism available in general, so for a good test it would be best to construct an example with plenty of parallelism. @slyfox, I don't believe perf is telling us anything sensible here. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:36 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Revisions: -------------------------------------+------------------------------------- Changes (by bgamari): * cc: scpmw (added) Comment: @scpmw, would it be possible to identify the expressions being blackholed (assuming that is indeed what is happening here) from `perf`'s output? It seems like this might be a good use-case for our new DWARF support if so. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:37 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
Reporter: carter | Owner:
Type: bug | Status: new
Priority: high | Milestone: 7.12.1
Component: Compiler | Version: 7.8.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #910, #8224 | Differential Revisions:
-------------------------------------+-------------------------------------
Comment (by slyfox):
Done the following:
- patched GHC [1] to print only blackholes in '+RTS -Da' and hide
unresolvable addresses as [addr]
- extended blackhole printer to dump '->indirectee' contents as a heap
object (is it safe?)
- added synchronization to printClosure to save stderr from multiple
threads
- linked ghc-stage2 with debug runtime
- ran as '-j8 +RTS -A128M -Da'
- sorted output by most frequent calls ('sort l | uniq -c | sort -n -k1 -r
| head -n30')
Here is the result:
{{{
4179540 BLACKHOLE(Object [addr] = ghc-prim:GHC.Types.I#([addr])\n)\n
2031781 BLACKHOLE(Object [addr] = ghc:CLabel.IdLabel([addr], [addr],
[addr])\n)\n
1782769 BLACKHOLE(Object [addr] =
FUN/1(
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Revisions: -------------------------------------+------------------------------------- Comment (by simonmar): I talked with @bgamari about this a bit today. What I think you're seeing in the above data is entries of *indirections*, not blackholes. Indirections and blackholes look the same, except that in a true black hole, the indirectee points to a TSO rather than a value. Looking for the cases where the indirectee is a TSO will tell you how many times we get blocked on a black hole. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:39 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 8.0.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Revisions: -------------------------------------+------------------------------------- Comment (by slyfox): Didin't try to inspect thunks for TSOs yet. Another datapoint: Built whole GHC with ```SRC_HC_OPTS += -feager-blackholing``` and ran on the same source from comment 28 with ```-j${i} +RTS -A128M >/dev/null```. The best wall-clock (real) result is achieved on -j8 (for box with 8 logical cores!) {{{ $ for i in `seq 1 40`; do echo "running $i"; time ./mk.bash -j${i} +RTS -A128M >/dev/null; done running 1 real 0m45.510s user 0m41.844s sys 0m2.699s running 2 real 0m26.609s user 0m42.121s sys 0m4.227s running 3 real 0m20.976s user 0m47.452s sys 0m5.892s running 4 real 0m18.391s user 0m48.771s sys 0m7.716s running 5 real 0m16.504s user 0m53.015s sys 0m9.339s running 6 real 0m16.108s user 0m59.391s sys 0m12.256s running 7 real 0m15.845s user 1m3.952s sys 0m15.632s running 8 real 0m15.163s user 1m9.096s sys 0m14.940s running 9 real 0m15.693s user 1m11.332s sys 0m17.059s running 10 real 0m16.155s user 1m13.646s sys 0m18.863s running 11 real 0m16.595s user 1m16.520s sys 0m19.622s running 12 real 0m19.407s user 1m18.191s sys 0m22.702s running 13 real 0m17.047s user 1m18.216s sys 0m22.033s running 14 real 0m17.668s user 1m19.592s sys 0m24.237s running 15 real 0m18.087s user 1m21.217s sys 0m26.945s running 16 real 0m18.304s user 1m21.447s sys 0m28.425s running 17 real 0m18.912s user 1m23.788s sys 0m29.615s running 18 real 0m19.715s user 1m24.838s sys 0m32.569s running 19 real 0m20.534s user 1m25.534s sys 0m31.648s running 20 real 0m21.440s user 1m27.836s sys 0m42.114s }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:41 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 8.0.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by rrnewton): If blackholing *is* the culprit then I wonder how localized the strictification will need to be... Ezyang's multi-process suggestion seems appealing. But if it's a `forkProcess` approach, is it possible to load most of the relevant state into memory before forking a child process per module? That could avoid or reduce communication through files or shared pages. The granularity of per-module compiles is big, so maybe processes would be ok. Also, there's the nice side effect that the GCs of child processes are disentangled, removing that particular scaling bottleneck. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:42 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 8.0.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by bgamari: Old description:
im seeing slowdowns in parallel builds of a (simple!!) 6 module project when I build it on a 40 core server i'm using for work. for any given ghc invocation with -jn, once n>10, i start to see a super linear slow down as a function of n
heres some basic numbers
at -j1 0m2.693s
at -j4 0m2.507s
at -j10 0m2.763s
at -j25 0m12.634s
at -j30 : 0m39.154s
at -j40 : 0m57.511s
at -j60 : 2m21.821s
these timings are another 2-4x worse if ghc is invoked indirectly via cabal-install / setup.hs
according to the linux utility latencytop, 100% of ghc's cpu time was spent on user-space lock contention when I did the -j40 invocation.
the timing in the -j40 case stayed the same even when ghc was also passed -O0 (and -fforce-recomp to ensure it did the same )
a bit of experimentation makes me believe that in *ANY* cabalized project on a 40 core machine will exhibit this perf issue.
cabal clean ; cabal configure --ghc-options="-j" ; cabal build -j1
should be enough to trigger the lock contention.
That said, I'll try to cook up a minimal repro that i can share the source for post haste.
New description: I'm seeing slowdowns in parallel builds of a (simple!!) 6 module project when I build it on a 40 core server that I'm using for work. For any given ghc invocation with `-jn`, once n>10, I start to see a super-linear slow down as a function of n, here are some basic numbers, ||= =||= compile time =|| || -j1 || 0m2.693s || || -j4 || 0m2.507s || || -j10 || 0m2.763s || || -j25 || 0m12.634s || || -j30 || 0m39.154s || || -j40 || 0m57.511s || || -j60 || 2m21.821s || These timings are another 2-4x worse if ghc is invoked indirectly via cabal-install / `Setup.hs` According to the linux utility latencytop, 100% of ghc's cpu time was spent on user-space lock contention when I did the `-j40` invocation. The timing in the `-j40` case stayed the same even when ghc was also passed `-O0` (and `-fforce-recomp` to ensure it did the same ) A bit of experimentation makes me believe that *any* cabalized project on a 40 core machine will exhibit this performance issue. {{{ cabal clean cabal configure --ghc-options="-j" cabal build -j1 }}} should be enough to trigger the lock contention. That said, I'll try to cook up a minimal repro that i can share the source for post haste. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:43 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: high | Milestone: 8.0.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by mboes): * cc: mboes (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:44 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * priority: high => normal * milestone: 8.0.1 => 8.2.1 Comment: I intend on looking into this but it doesn't that anything will happen for 8.0 and indeed it's not terribly high priority given that GHC's `-j` option isn't very widely used. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:45 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by nh2): {{{ given that GHC's -j option isn't very widely used }}} I think it's only not used so widely because it doesn't work well yet due to this bug - I have worked on 3 projects of 3 companies that would significantly benefit from it. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:46 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by slyfox): Don't know about other users but starting from ghc-7.8 every gentoo packages are built with -j4 on machines with enough CPU cores. Thinking about building initial ghc-cabal for GHC build in parallel. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:47 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by dobenour): I remember reading somewhere that the RTS has a global lock for allocating sufficiently large objects. Could that be the problem? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:48 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by kazu-yamamoto): FYI: http://www.yesodweb.com/blog/2014/02/new-warp -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:49 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by slyfox): As -feaged-blackholing helps for this case how would one find out statistics of closures needlessly computed multiple times? One of 'unsafeDupable' examples that might be heavyweight when done multiple times: http://git.haskell.org/ghc.git/blob/HEAD:/libraries/base/GHC/Fingerprint.hs#... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:50 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by rrnewton): @slyfox -- can we get a fixed up version of that black hole debugging patch merged? That seems really great. Ideally I'd like to put GHC in a debugging mode were the first time a thread blocks on a black hole the program crashes with a stack trace ;-). Edit rinse repeat until they're all gone. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:51 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
Ideally I'd like to put GHC in a debugging mode were the first time a
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): thread blocks on a black hole the program crashes with a stack trace ;-). Edit rinse repeat until they're all gone. You may be interested in Phab:D1216, which builds on the the DWARF-based statistical profiling support that I have [[DWARF/80Status|in the works]] for provide profiling of black-hole entry events. As-written this patch only captures the head of the stack, although this could be changed to provide a deeper stacktrace. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:52 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by slyfox): I've experimented a bit more with trying to pin down where slowdown comes from. Some observations: Observation 1. -j <K> not only allows <K> modules to be compiled at the same time, but also enables: - <K> Capabilities - and <K> garbage collection threads I've locally removed Capability adjustment from -j handling and used -j <K> +RTS -N. That does not make performance as bad with increasing K. That makes sense GC OS threads don't fight over the same cache. It would be nice if '''+RTS -N''' would have a precedence over -j option Observation 2. [Warning: I have no idea how parallel GC works]. The more GC threads we have - the more chances are that one of GC threads will finish scanning it's part oh heap and will sit in sched_yield() loop on a free core while main GC thread waits for completion of other threads doing useful work. I've found it out by changing yieldThread() to print it's caller. Vast majority of calls comes from any_work(): {{{ static rtsBool any_work (void) { int g; gen_workspace *ws; gct->any_work++; write_barrier(); // scavenge objects in compacted generation if (mark_stack_bd != NULL && !mark_stack_empty()) { return rtsTrue; } // Check for global work in any gen. We don't need to check for // local work, because we have already exited scavenge_loop(), // which means there is no local work for this thread. for (g = 0; g < (int)RtsFlags.GcFlags.generations; g++) { ws = &gct->gens[g]; if (ws->todo_large_objects) return rtsTrue; if (!looksEmptyWSDeque(ws->todo_q)) return rtsTrue; if (ws->todo_overflow) return rtsTrue; } #if defined(THREADED_RTS) if (work_stealing) { uint32_t n; // look for work to steal for (n = 0; n < n_gc_threads; n++) { if (n == gct->thread_index) continue; for (g = RtsFlags.GcFlags.generations-1; g >= 0; g--) { ws = &gc_threads[n]->gens[g]; if (!looksEmptyWSDeque(ws->todo_q)) return rtsTrue; } } } #endif gct->no_work++; #if defined(THREADED_RTS) yieldThread("any_work"); #endif return rtsFalse; } }}} I need to dig more into how parallel GC traverses heap to understand how much of a problem it is. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:53 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): @slyfox the GC threads all spin in this `any_work()` loop looking for work they can steal from other threads. It's how the GC work gets distributed amongst the available worker threads. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:54 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by slyfox): Replying to [comment:54 simonmar]:
@slyfox the GC threads all spin in this `any_work()` loop looking for work they can steal from other threads. It's how the GC work gets distributed amongst the available worker threads.
Aha, thanks! I've read through http://community.haskell.org/~simonmar/papers/parallel-gc.pdf The paper does not look outdated. The crucial detail: work stealing is enabled by default for gen=1 and upper. As default nursery is tiny we don't do stealing from it. That's why I see poor GC parallelism on large nurseries for this compilation workload: [ The numbers below are for 4-CPU system, not 8-CPU i used to post before. I'll redo tests in a few days once I get access to it. ] -qb1/-qb0 comparison: -qb1 (default): {{{ $ time ./mk.bash -j4 +RTS -N4 -RTS +RTS -sstderr -A768M -RTS -O0 -j4 +RTS -l 55,162,983,632 bytes allocated in the heap 639,999,688 bytes copied during GC 88,807,552 bytes maximum residency (2 sample(s)) 4,100,072 bytes maximum slop 3379 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 20 colls, 20 par 5.617s 1.582s 0.0791s 0.1410s Gen 1 2 colls, 1 par 0.227s 0.081s 0.0405s 0.0500s Parallel GC work balance: 22.34% (serial 0%, perfect 100%) TASKS: 13 (1 bound, 12 peak workers (12 total), using -N4) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.027s ( 0.026s elapsed) MUT time 59.245s ( 18.963s elapsed) GC time 5.843s ( 1.663s elapsed) EXIT time 0.032s ( 0.034s elapsed) Total time 65.174s ( 20.686s elapsed) Alloc rate 931,102,797 bytes per MUT second Productivity 91.0% of total user, 91.8% of total elapsed gc_alloc_block_sync: spin=10830; yield=5 whitehole_spin: 0 gen[0].sync: spin=6297; yield=5 gen[1].sync: spin=4240; yield=3 real 0m20.863s user 1m5.752s sys 0m4.724s }}} -qb0: {{{ 55,154,416,472 bytes allocated in the heap 840,077,056 bytes copied during GC 135,018,976 bytes maximum residency (3 sample(s)) 5,739,552 bytes maximum slop 2375 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 31 colls, 31 par 3.078s 0.859s 0.0277s 0.0682s Gen 1 3 colls, 2 par 0.607s 0.177s 0.0591s 0.0947s Parallel GC work balance: 74.48% (serial 0%, perfect 100%) TASKS: 13 (1 bound, 12 peak workers (12 total), using -N4) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.023s ( 0.023s elapsed) MUT time 59.476s ( 18.325s elapsed) GC time 3.685s ( 1.036s elapsed) EXIT time 0.027s ( 0.028s elapsed) Total time 63.244s ( 19.412s elapsed) Alloc rate 927,345,481 bytes per MUT second Productivity 94.1% of total user, 94.5% of total elapsed gc_alloc_block_sync: spin=58847; yield=38 whitehole_spin: 0 gen[0].sync: spin=126024; yield=104 gen[1].sync: spin=34648; yield=31 real 0m19.575s user 1m6.804s sys 0m1.760s }}} Note how GC parallelism increases from ~20% to ~75% for this huge nursery of 768M. Something hinders mutator's parallelism still. threadscope says threads are usually blocked on MVars for long periods of time (order of 0.1-0.2 seconds). Perhaps those wait for external gcc/gas to assemble modules. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:55 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): Yes, perhaps we should default to -qb0 when -A is larger than some threshold. The mutator parallelism looks not too bad (3.24 out of 4). Was there enough parallelism in the program you were compiling to do better than that? The `MVar`s are probably just the compilation manager: it fires up a thread for every module, and they wait on the results of compiling the modules they depend on. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:56 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
Yes, perhaps we should default to -qb0 when -A is larger than some
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by slyfox): Replying to [comment:56 simonmar]: threshold.
The mutator parallelism looks not too bad (3.24 out of 4). Was there
The `MVar`s are probably just the compilation manager: it fires up a
enough parallelism in the program you were compiling to do better than that? I assumed there is as it's ~100 independent haskell modules, but some of them are disproportionally large. Got to 24-core VM to run the same benchmark. At first it manages to peak 24 HEC utilisation but at the tail it clearly has not enough parallelism: -qb1: http://code.haskell.org/~slyfox/T9221/ghc- stage2.eventlog.N24.j24.png -qb0: http://code.haskell.org/~slyfox/T9221/ghc- stage2.eventlog.N24.j24.A256M.qb0.png thread for every module, and they wait on the results of compiling the modules they depend on. Yes, that's more likely. It would be cool to have labelMVar similar to existing labelThread to see IDs/names of blocker MVars. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:57 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by slyfox): * Attachment "synth.bash" added. synth.bash - perfectly parallel workload for ghc -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
Reporter: carter | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 8.2.1
Component: Compiler | Version: 7.8.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #910, #8224 | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Sergei Trofimovich
#9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
Reporter: carter | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 8.2.1
Component: Compiler | Version: 7.8.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #910, #8224 | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Sergei Trofimovich
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by carter): Very cool work! Do we want to limit the number of capabilities to the number of hyper threaded cores or the number of physical cores? Likewise, another opportunity for improvement might be seeing about making sure the capabilities (when CPU pining is enabled for the rts) are spread out maximally across the available CPU cores? I believe this or a similar issue was discussed / observed by Ryan Yates / fryguybob previously but I'm not sure if we had worked out a portable way to decide this or not -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:60 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by slyfox): [warning: not a NUMA expert] Tl;DR: I think it depends on what exactly we hit as a bottleneck. I have a suspiction we saturate RAM bandwidth, not CPU ability to retire instructions due to hyperthreads. Basically GHC does too many non-local references and one of the ways to speed GHC up is either insrease memory locality or decrease HEAP usage. Long version: For a while I tried to figure out why exactly I don't see perfect scaling of '''ghc --make''' on my box. It's easy to see/compare with '''synth.bash +RTS -A256M -RTS''' benchmark ran with '''-j1''' / '''-j''' options. I don't have hard evidence but I suspect bottleneck is not due to hyperthreads/real core execution engines but due to RAM bandwidth limit on CPU-to-memory path. One of the hints is '''perf stat''': {{{ $ perf stat -e cache-references,cache- misses,cycles,instructions,branches,faults,migrations,mem-loads,mem-stores ./synth.bash -j +RTS -sstderr -A256M -qb0 -RTS Performance counter stats for './synth.bash -j +RTS -sstderr -A256M -qb0 -RTS': 3 248 577 545 cache-references (28,64%) 740 590 736 cache-misses # 22,797 % of all cache refs (42,93%) 390 025 361 812 cycles (57,18%) 171 496 925 132 instructions # 0,44 insn per cycle (71,45%) 33 736 976 296 branches (71,47%) 1 061 039 faults 1 524 migrations 67 895 mem-loads (71,42%) 27 652 025 890 mem-stores (14,27%) 15,131608490 seconds time elapsed }}} 22% of all cache refs are misses. A huge number. I think it dominates performance (assuming memory access is ~100 times slower than CPU cache access), but I have no hard evidence :) I have 4 cores with x2 hyperthreads each and get best performance from -j8, not -j4 as one would expect from hyperthreads inctruction retirement: -j1: 55s; -j4: 18s; -j6: 15s; j8: 14.2s; -j10: 15.0s {{{ ./synth.bash -j +RTS -sstderr -A256M -qb0 -RTS 66,769,724,456 bytes allocated in the heap 1,658,350,288 bytes copied during GC 127,385,728 bytes maximum residency (5 sample(s)) 1,722,080 bytes maximum slop 2389 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 31 colls, 31 par 6.535s 0.831s 0.0268s 0.0579s Gen 1 5 colls, 4 par 1.677s 0.225s 0.0449s 0.0687s Parallel GC work balance: 80.03% (serial 0%, perfect 100%) TASKS: 21 (1 bound, 20 peak workers (20 total), using -N8) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.002s ( 0.002s elapsed) MUT time 87.599s ( 12.868s elapsed) GC time 8.212s ( 1.056s elapsed) EXIT time 0.013s ( 0.015s elapsed) Total time 95.841s ( 13.942s elapsed) Alloc rate 762,222,437 bytes per MUT second Productivity 91.4% of total user, 92.4% of total elapsed gc_alloc_block_sync: 83395 whitehole_spin: 0 gen[0].sync: 280927 gen[1].sync: 134537 real 0m14.070s user 1m44.835s sys 0m2.899s }}} I've noticed that building GHC with '''-fno-worker-wrapper -fno-spec- constr''' makes GHC 4% faster (-j8) (memory allocations is 7% less, a bug #11565 is likely at fault) which also hints at memory throughput as a bottleneck. The conclusion: AFAIU, thus to make most of GHC we should strive for amount of active threads capable of saturating all the memory IO channels the machine has (but not much more). '''perf bench mem all''' suggests RAM bandwidth performance is in range of 2-32GB/s depending how bad workload is. I would assume GHC workload is very non- linear (and thus bad). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:61 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by rrnewton): Would people agree that we should not have spinlocks in the system without some back-off strategy? E.g. switch to sleeping/blocking after some number of iterations. Even a constant upper bound could reign in worst case behaviors. Apparently OS's provide some notion of "adaptive lock" that has other nice features, e.g.: http://stackoverflow.com/questions/19863734/what-is- pthread-mutex-adaptive-np -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:62 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by carter): Replying to [comment:61 slyfox]:
[warning: not a NUMA expert]
Tl;DR:
I think it depends on what exactly we hit as a bottleneck.
I have a suspiction we saturate RAM bandwidth, not CPU ability to retire instructions due to hyperthreads. Basically GHC does too many non-local references and one of the ways to speed GHC up is either insrease memory locality or decrease HEAP usage.
That's exactly why I'm wondering if hyper threading is messing with us! Each pair of hyper thread cores shares the same l1 l2 cache, so if we're memory limited that might be Triggering a higher rate of cache thrash? Also in some cases when capabilities numbers are below the number of cores I think we pin two capabitilties to the same physical core needlessly. I need to dig up those references and revisit that though :)
Long version:
For a while I tried to figure out why exactly I don't see perfect scaling of '''ghc --make''' on my box.
It's easy to see/compare with '''synth.bash +RTS -A256M -RTS'''
benchmark ran with '''-j1''' / '''-j''' options.
I don't have hard evidence but I suspect bottleneck is not due to hyperthreads/real core execution engines but due to RAM bandwidth limit on CPU-to-memory path. One of the hints is '''perf stat''':
{{{ $ perf stat -e cache-references,cache-
misses,cycles,instructions,branches,faults,migrations,mem-loads,mem-stores ./synth.bash -j +RTS -sstderr -A256M -qb0 -RTS
Performance counter stats for './synth.bash -j +RTS -sstderr -A256M
-qb0 -RTS':
3 248 577 545 cache-references
(28,64%)
740 590 736 cache-misses # 22,797 % of all
390 025 361 812 cycles (57,18%) 171 496 925 132 instructions # 0,44 insn per cycle (71,45%) 33 736 976 296 branches (71,47%) 1 061 039 faults 1 524 migrations 67 895 mem-loads (71,42%) 27 652 025 890 mem-stores (14,27%)
15,131608490 seconds time elapsed }}}
22% of all cache refs are misses. A huge number. I think it dominates
(assuming memory access is ~100 times slower than CPU cache access), but I have no hard evidence :)
I have 4 cores with x2 hyperthreads each and get best performance from -j8, not -j4 as one would expect from hyperthreads inctruction retirement:
-j1: 55s; -j4: 18s; -j6: 15s; j8: 14.2s; -j10: 15.0s
{{{ ./synth.bash -j +RTS -sstderr -A256M -qb0 -RTS
66,769,724,456 bytes allocated in the heap 1,658,350,288 bytes copied during GC 127,385,728 bytes maximum residency (5 sample(s)) 1,722,080 bytes maximum slop 2389 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
Gen 0 31 colls, 31 par 6.535s 0.831s 0.0268s 0.0579s Gen 1 5 colls, 4 par 1.677s 0.225s 0.0449s 0.0687s
Parallel GC work balance: 80.03% (serial 0%, perfect 100%)
TASKS: 21 (1 bound, 20 peak workers (20 total), using -N8)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.002s ( 0.002s elapsed) MUT time 87.599s ( 12.868s elapsed) GC time 8.212s ( 1.056s elapsed) EXIT time 0.013s ( 0.015s elapsed) Total time 95.841s ( 13.942s elapsed)
Alloc rate 762,222,437 bytes per MUT second
Productivity 91.4% of total user, 92.4% of total elapsed
gc_alloc_block_sync: 83395 whitehole_spin: 0 gen[0].sync: 280927 gen[1].sync: 134537
real 0m14.070s user 1m44.835s sys 0m2.899s }}}
I've noticed that building GHC with '''-fno-worker-wrapper -fno-spec- constr''' makes GHC 4% faster (-j8) (memory allocations is 7% less, a bug #11565 is likely at fault) which also hints at memory throughput as a bottleneck.
The conclusion:
AFAIU, thus to make most of GHC we should strive for amount of active
capable of saturating all the memory IO channels the machine has (but not much more).
'''perf bench mem all''' suggests RAM bandwidth performance is in range of 2-32GB/s depending how bad workload is. I would assume GHC workload is very non-
cache refs (42,93%) performance pause threads linear (and thus bad). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:63 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): GHC has a `+RTS --numa` option that you might want to try. Memory bandwidth *might* be a scaling issue, it depends a lot on what hardware you're using (how many memory channels, are they populated with DIMMs?). You could always verify this by comparing the scaling that you get with N independent GHC processes vs. `ghc -jN`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:64 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by slyfox): Used the following GNUMakefile for '''./synth.bash''' to compare separate processes: {{{ OBJECTS := $(patsubst %.hs,%.o,$(wildcard src/*.hs)) all: $(OBJECTS) src/%.o: src/%.hs ~/dev/git/ghc-perf/inplace/bin/ghc-stage2 -c +RTS -A256M -RTS $< -o $@ clean: $(RM) $(OBJECTS) .PHONY: clean }}} CPU topology: {{{ $ lstopo-no-graphics Machine (30GB) Socket L#0 + L3 L#0 (8192KB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#4) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#5) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#6) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#7) $ numactl -H available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 31122 MB node 0 free: 28003 MB node distances: node 0 0: 10 }}} Separate processes: {{{ $ make clean; time make -j1 real 1m2.561s user 0m56.523s sys 0m5.560s $ make clean; time taskset --cpu-list 0-3 make -j4 real 0m18.756s user 1m7.758s sys 0m6.460s $ make clean; time make -j4 real 0m18.936s user 1m7.549s sys 0m6.857s $ make clean; time make -j6 real 0m17.365s user 1m32.107s sys 0m9.155s $ make clean; time make -j8 real 0m15.964s user 1m52.058s sys 0m9.929s }}} The speedup compared to -j1 is almost exactly 4x, but it happens on -j higher than 4 as well. Using CPU affinity makes things better on -j4. {{{ $ ./synth.bash -j1 +RTS -sstderr -A256M -qb0 -RTS real 0m51.702s user 0m50.840s sys 0m0.844s $ ./synth.bash -j4 +RTS -sstderr -A256M -qb0 -RTS real 0m17.526s user 1m6.978s sys 0m1.412s $ ./synth.bash -j4 +RTS -sstderr -A256M -qb0 -qa -RTS real 0m17.007s user 1m4.867s sys 0m1.508s $ ./synth.bash -j8 +RTS -sstderr -A256M -qb0 -RTS real 0m13.829s user 1m44.295s sys 0m2.669s $ ./synth.bash -j8 +RTS -sstderr -A256M -qb0 -qa -RTS real 0m14.597s user 1m43.145s sys 0m3.285s }}} The speedup compared to -j1 is around 3.5x, also happens on -j higher than 4. Using CPU affinity makes things worse on -j4. In absolute times '''ghc --make -j''' is slightly better that separate processes due to less startup(?) overhead. But something else slowly creeps up and we don't see 4x factor. It's more visible on 24-core VM, will post in a few minutes. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:65 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by slyfox): 24-core VM. CPU topology: {{{ $ lstopo-no-graphics Machine (118GB) Package L#0 + L3 L#0 (30MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#1) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#2) PU L#3 (P#3) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#4) PU L#5 (P#5) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#6) PU L#7 (P#7) L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 PU L#8 (P#8) PU L#9 (P#9) L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 PU L#10 (P#10) PU L#11 (P#11) L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 PU L#12 (P#12) PU L#13 (P#13) L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 PU L#14 (P#14) PU L#15 (P#15) L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 PU L#16 (P#16) PU L#17 (P#17) L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 PU L#18 (P#18) PU L#19 (P#19) L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 PU L#20 (P#20) PU L#21 (P#21) L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 PU L#22 (P#22) PU L#23 (P#23) $ numactl -H available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 node 0 size: 120881 MB node 0 free: 120192 MB node distances: node 0 0: 10 }}} (I would not trust numactl output). Separate processes: {{{ $ make clean; time make -j1 real 1m33.147s user 1m20.836s sys 0m11.556s $ make clean; time make -j10 real 0m11.275s user 1m29.800s sys 0m12.856s $ make clean; time make -j12 real 0m10.537s user 1m36.276s sys 0m16.948s $ make clean; time make -j14 real 0m9.117s user 1m39.132s sys 0m18.332s $ make clean; time make -j20 real 0m8.498s user 2m7.064s sys 0m17.912s $ make clean; time make -j22 real 0m7.468s user 2m9.808s sys 0m18.592s $ make clean; time make -j24 real 0m7.336s user 2m15.936s sys 0m19.004s $ make clean; time make -j26 real 0m7.433s user 2m17.612s sys 0m19.648s $ make clean; time make -j28 real 0m7.554s user 2m17.760s sys 0m19.564s $ make clean; time make -j30 real 0m7.563s user 2m16.776s sys 0m21.104s }}} Numbers are jumping slightly from run to run but the gist is best performance is around -j24, not -j12. Single process: {{{ $ ./synth.bash -j1 +RTS -sstderr -A256M -qb0 -RTS real 1m15.214s user 1m14.060s sys 0m0.984s $ ./synth.bash -j8 +RTS -sstderr -A256M -qb0 -RTS real 0m11.275s user 1m21.708s sys 0m2.912s $ ./synth.bash -j10 +RTS -sstderr -A256M -qb0 -RTS real 0m10.279s user 1m25.184s sys 0m3.664s $ ./synth.bash -j12 +RTS -sstderr -A256M -qb0 -RTS real 0m9.605s user 1m32.688s sys 0m4.292s $ ./synth.bash -j14 +RTS -sstderr -A256M -qb0 -RTS real 0m9.144s user 1m40.288s sys 0m4.964s $ ./synth.bash -j16 +RTS -sstderr -A256M -qb0 -RTS real 0m10.003s user 1m51.916s sys 0m6.604s $ ./synth.bash -j20 +RTS -sstderr -A256M -qb0 -RTS real 0m10.215s user 2m7.924s sys 0m8.208s $ ./synth.bash -j22 +RTS -sstderr -A256M -qb0 -RTS real 0m10.483s user 2m13.440s sys 0m10.456s $ ./synth.bash -j24 +RTS -sstderr -A256M -qb0 -RTS real 0m10.985s user 2m18.028s sys 0m10.780s $ ./synth.bash -j32 +RTS -sstderr -A256M -qb0 -RTS real 0m12.636s user 2m32.312s sys 0m14.508s }}} Here we see best numbers around -j12 and those are worse than multiprocess run. From '''perf record''' it's not very clear what happens. I'll try to get a 64-core VM next week and see if the effect will be visible there much better. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:66 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): So I think this tells us that it is not memory bandwidth saturation, right? Because running individual GHC processes scales better than a single parallel GHC. To be sure, you should also use the larger heap settings with the `make -jN` runs. So there is still extra contention in `ghc -jN`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:67 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
Reporter: carter | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 8.2.1
Component: Compiler | Version: 7.8.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #910, #8224 | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Sergei Trofimovich
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by nh2): I'm not sure if this is of any help as you already mentioned `sched_yield`, but I noticed that when running even `ghc --make -j2` on my project with 30 modules, `sudo csysdig` (a program to easily report syscall activity) reports betwen 1-2 *million* calls per second. Is this to be expected? That seems like an aweful lot of syscalls. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:69 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by slyfox): I think it's expected if you have GC work imbalance. Some GC work-stealing threads sit in yield loop at: http://git.haskell.org/ghc.git/blob/HEAD:/rts/sm/GC.c#l981 Imbalance can be seen seen as: '''+RTS -sstderr''' shows it as '''Parallel GC work balance: 80.03% (serial 0%, perfect 100%)'''. and can be tuned by heap size or allocation area ('''-A'''). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:70 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by slyfox): I've ran '''synth.bash''' benchmark ('''ghc make -j +RTS -A256M''') on current '''ghc-HEAD''' on the following machines: - 4x2 HT desktop, speedup is '''3.9x''' compared to '''-j1''' case (best at -j8) - 12x2 HT haswell VM, speedup is '''12.7x''' compared to '''-j1''' case (best at -j24) - 8x4 HT Power Power7, speedup is '''8.1x''' compared to '''-j1''' case (best at -j12) Very similar speedup is achieved in multiprocess mode (via Makefile in comment:#65). A few facts: - Multiprocess run is ~10% faster than multithreaded. - Specifying N more than logical CPUs available (say, -j100) in '''ghc make -j<N>''' does not lead to severe performance degradation at least on up to 12 core hardware. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:71 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): With a very large `-A` like that you probably want to use nursery chunks: `-n4m`. It's also worth playing with `-qn`, e.g. on the 12x2 Haswell I suggest `-N24 -A128m -n4m -qn12` should work well. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:72 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by gidyn): Would it be possible for `-j` to select more suitable GC parameters by default, for the benefit of those who aren't experts at parallel GC tuning? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:73 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): I think the RTS itself should auto-tune the parameters more than it does. The only reason it doesn't right now is because adding parameters is easy while automatically setting them to good values is hard. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:74 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by carter): what are some good "standard profiles"? those could still be semi parameterized, but we could have "for a workload like xyzlmnop, --batchoffflinethroughput is good" or something -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:75 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
Reporter: carter | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 8.2.1
Component: Compiler | Version: 7.8.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #910, #8224 | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari
#9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
Reporter: carter | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 8.2.1
Component: Compiler | Version: 7.8.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #910, #8224 | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by hjarvi): * cc: hjarvi (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:78 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by sjakobi): * cc: sjakobi (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:79 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by michalt): * cc: michalt (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:80 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by mnislaih): * cc: mnislaih (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:81 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by rtfeldman): * cc: rtfeldman (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:83 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by nh2): Hey, a question about `sched_yield()`: I just read the [http://man7.org/linux/man-pages/man2/sched_yield.2.html man page] for `sched_yield()` again. It says:
`sched_yield()` is intended for use with read-time scheduling policies (i.e., `SCHED_FIFO` or `SCHED_RR`). Use of `sched_yield()` with nondeterministic scheduling policies such as `SCHED_OTHER` is unspecified and very likely means your application design is broken.
Linux provides the `sched_yield()` system call as a mechanism for a
This call used to simply move the process to the end of the run queue; now it moves the process to the "expired" queue, effectively cancelling
Does GHC set the `FIFO` or`RR` policy? If not, then according to that our "application design is broken". I also found some interesting info on http://www.informit.com/articles/article.aspx?p=101760&seqNum=5 (emphasis mine): process to explicitly yield the processor to other waiting processes. It works by removing the process from the active array (where it currently is, because it is running) and inserting it into the expired array. This has the effect of not only preempting the process and putting it at the end of its priority list, but putting it on the expired list — **guaranteeing it will not run for a while**. Because real-time tasks never expire, they are a special case. Therefore, they are merely moved to the end of their priority list (and not inserted into the expired array). **In earlier versions of Linux, the semantics of the `sched_yield()` call were quite different; at best, the task was only moved to the end of their priority list**. The yielding was often not for a very long time. Nowadays, applications and even kernel code should be certain they truly want to give up the processor before calling `sched_yield()`. A similar article on LWN: https://lwn.net/Articles/31462/ the rest of the process's time slice. So a process calling `sched_yield()` now must wait until all other runnable processes in the system have used up their time slices before it will get the processor again. The article goes on to explain that this resulted bad performance especially for
threaded applications [that] implement busy-wait loops with `sched_yield()`
Might this be relevant here? Also, can someone explain me why GHC is using `sched_yield()` at all? If the purpose is to wait until other GC threads are done, wouldn't `futex()` be enough? Or is that what's explained in https://ghcmutterings.wordpress.com/2010/01/25/yielding-more-improvements- in-parallel-performance/ ? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:84 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by duog): * cc: duog (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:85 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by duog): ticket:3553 discusses `sched_yield` and has a patch that tries futexes. Apparently `sched_yield` was inexplicably faster. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:86 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by lelf): * cc: lelf (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:88 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
Reporter: carter | Owner: (none)
Type: bug | Status: new
Priority: normal | Milestone: 8.6.1
Component: Compiler | Version: 7.8.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #910, #8224 | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by kanetw): * cc: kanetw (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:90 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by maoe): * cc: maoe (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:91 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#9221: (super!) linear slowdown of parallel builds on 40 core machine -------------------------------------+------------------------------------- Reporter: carter | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.8.1 Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #910, #8224 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by Wizek): * cc: Wizek (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9221#comment:93 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC