Hi Karel,
could you try adding `-j8` to `SRC_HC_OPTS` for the build flavor you're using in `mk/
build.mk`, and running `gmake -j8` instead of `gmake -j64`. A graph like the one you attached will likely look even worse, but the walltime of your build should hopefully be improved.
The build system seems to currently rely entirely on `make` for parallelism. It doesn't exploit ghc's own parallel `--make` at all, unless you explictly add `-jn` to SRC_HC_OPTS, with n>1 (which also sets the number of capabilities for the runtime system, so also adding `+RTS -Nn` is not needed).
Case study: One of the first things the build system does is build ghc-cabal and Cabal using the stage 0 compiler, through a single invocation of `ghc --make`. All the later make targets depend on that step to complete first. Because `ghc --make` is not instructed to build in parallel, using `make -j1` or `make -j100000` doesn't make any difference (for that step). I think your graph shows that there are many of more of such bottlenecks.
You would have to find out empirically how to best divide your number of threads (32) between `make` and `ghc --make`. From reading this
comment by Simon in #9221 I understand it's better not to call `ghc --make -jn` with `n` higher than the number of physical cores of your machine (8 in your case). Once you get some better parallelism, other flags like `-A` might also have an effect on walltime (see that ticket).
-Thomas