
Hi everyone, Last week I discussed the plan for merging the LinearTypes branch into GHC 8.12 with Arnaud, Richard, Andreas, and Simon. Many thanks to all of them for their respective roles in pushing this patch over the finish line. One thing that we wanted to examine prior to merge is compiler performance across a larger collection of packages. For this I used the head.hackage patch-set, comparing the Linear Types branch with its corresponding base commit in `master`. Here I will describe the methodology used for this comparison and briefly summarize the (happily, quite positive) results. # Methodology I collected total bytes allocated (as reported by the runtime system), elapsed runtime (as reported by the runtime system), and instructions (as reported by `perf stat`) of head.hackage builds in two configurations: * the `opt` configuration * the `noopt` configuration, which passed `--disable-optimisation` to cabal-install These configurations were evaluated on two commits: * `master`: 2b792facab46f7cdd09d12e79499f4e0dcd4293f * `linear-bang`: 481cf412d6e619c0e47960f4c70fb21f19d6996d Unfortunately, the `noopt` configuration appears to be be affected by a few cabal-install bugs [1,2] and consequently some packages may *still* be compiled with optimisation, so take these numbers with a grain of salt. The test environment was a reasonably quiet Ryzen 7 1800X with 32 GBytes of RAM. The test was run by first building the two tested commits in Hadrian's default build flavour. The head.hackage CI driver was then invoked as follows: # Don't parallelize for stable performance measurements export CPUS=1 export USE_NIX=1 export EXTRA_HC_OPTS=-ddump-timings export COLLECT_PERF_STATS=1 mkdir -p runs # master export GHC=/home/ben/ghc/ghc-compare-2/_build/stage1/bin/ghc ./run-ci --cabal-option=--disable-optimisation mv ci/run runs/master-noopt ./run-ci mv ci/run runs/master-opt # linear-bang export GHC=/home/ben/ghc/ghc-compare-1/_build/stage1/bin/ghc ./run-ci --cabal-option=--disable-optimisation mv ci/run runs/linear-noopt ./run-ci mv ci/run runs/linear-opt As we are building all packages (nearly 300 in total) serially, the full run takes quite a while (around 8 hours IIRC). The final run of this test used head.hackage commit e7e5c5cfbfd42c41b1e62d42bb18483a83b78701 (on the `rts-stats` branch). # Results I examined several different metrics of compiler performance * the total_wall_seconds RTS metric gives an picture of overall compilation effort * time reported by -ddump-timings, summed by module, gives a slightly finer-grained measurement of per-module compilation time * the RTS's bytes_allocated metric gives overall compiler allocations * the RTS"s max_bytes_used metric gives a sense of AST size (and potentially the existence of leaks) To cut straight to the chase, the measurements show the following: metric -O0 -O1 ------------------- --------- ---------- total_wall_seconds +0.3% +0.6% total_cpu_seconds +0.3% +0.7% max_bytes_used +4.2% +4.8% GC_cpu_seconds +1.5% +2.1% mut_cpu_seconds no change no change sum(per-module-time) +4.2% +4.2% sum(per-module-alloc) +0.8% +0.8% There are a few things to point out here: the overall change in compiler runtime is thankfully quite reasonable. However, max_bytes_used increases rather considerably. This seems to give rise to an appreciable regression in GC time. It would be interesting to know whether this can be improved with optimisation to data representation. The fact that the cumulative per-module metrics didn't change between -O0 and -O1 indicate to me that there is a methodological problem which needs to be addressed in the test infrastructure. I investigated this a bit and have a hypothesis for what might be going on here; nevertheless, in the interest of publishing these measurements I'm ignoring these measurements for the time being. I have attached the Jupyter notebook that gave rise to these numbers. This gives a finer-grained breakdown of the data including histograms showing the variance of each metric. Perhaps this will be helpful in better understanding the effects. I would be happy to share my run data as well although it is a bit large. All-in-all, the Tweag folks have done a great job in squashing the performance numbers noticed a few weeks ago. The current numbers look quite acceptable for GHC 8.12. Congratulations to Arnaud, Krzysztof, and Richard on landing this feature! I'm very much looking forward to see what the community does with it in the coming years. Cheers, - Ben [1] https://github.com/haskell/cabal/issues/5353 [2] https://github.com/haskell/cabal/issues/3883

Great work! I'm very excited to see these perf issues squashed.
Thanks to everyone working on this, and also to Ben for such thorough
benchmarking work!
On Wed, Jun 17, 2020, 11:11 AM Ben Gamari
Hi everyone,
Last week I discussed the plan for merging the LinearTypes branch into GHC 8.12 with Arnaud, Richard, Andreas, and Simon. Many thanks to all of them for their respective roles in pushing this patch over the finish line.
One thing that we wanted to examine prior to merge is compiler performance across a larger collection of packages. For this I used the head.hackage patch-set, comparing the Linear Types branch with its corresponding base commit in `master`. Here I will describe the methodology used for this comparison and briefly summarize the (happily, quite positive) results.
# Methodology
I collected total bytes allocated (as reported by the runtime system), elapsed runtime (as reported by the runtime system), and instructions (as reported by `perf stat`) of head.hackage builds in two configurations:
* the `opt` configuration * the `noopt` configuration, which passed `--disable-optimisation` to cabal-install
These configurations were evaluated on two commits:
* `master`: 2b792facab46f7cdd09d12e79499f4e0dcd4293f * `linear-bang`: 481cf412d6e619c0e47960f4c70fb21f19d6996d
Unfortunately, the `noopt` configuration appears to be be affected by a few cabal-install bugs [1,2] and consequently some packages may *still* be compiled with optimisation, so take these numbers with a grain of salt.
The test environment was a reasonably quiet Ryzen 7 1800X with 32 GBytes of RAM.
The test was run by first building the two tested commits in Hadrian's default build flavour. The head.hackage CI driver was then invoked as follows:
# Don't parallelize for stable performance measurements export CPUS=1 export USE_NIX=1 export EXTRA_HC_OPTS=-ddump-timings export COLLECT_PERF_STATS=1
mkdir -p runs
# master export GHC=/home/ben/ghc/ghc-compare-2/_build/stage1/bin/ghc ./run-ci --cabal-option=--disable-optimisation mv ci/run runs/master-noopt ./run-ci mv ci/run runs/master-opt
# linear-bang export GHC=/home/ben/ghc/ghc-compare-1/_build/stage1/bin/ghc ./run-ci --cabal-option=--disable-optimisation mv ci/run runs/linear-noopt ./run-ci mv ci/run runs/linear-opt
As we are building all packages (nearly 300 in total) serially, the full run takes quite a while (around 8 hours IIRC).
The final run of this test used head.hackage commit e7e5c5cfbfd42c41b1e62d42bb18483a83b78701 (on the `rts-stats` branch).
# Results
I examined several different metrics of compiler performance
* the total_wall_seconds RTS metric gives an picture of overall compilation effort
* time reported by -ddump-timings, summed by module, gives a slightly finer-grained measurement of per-module compilation time
* the RTS's bytes_allocated metric gives overall compiler allocations
* the RTS"s max_bytes_used metric gives a sense of AST size (and potentially the existence of leaks)
To cut straight to the chase, the measurements show the following:
metric -O0 -O1 ------------------- --------- ---------- total_wall_seconds +0.3% +0.6% total_cpu_seconds +0.3% +0.7% max_bytes_used +4.2% +4.8% GC_cpu_seconds +1.5% +2.1% mut_cpu_seconds no change no change sum(per-module-time) +4.2% +4.2% sum(per-module-alloc) +0.8% +0.8%
There are a few things to point out here: the overall change in compiler runtime is thankfully quite reasonable. However, max_bytes_used increases rather considerably. This seems to give rise to an appreciable regression in GC time. It would be interesting to know whether this can be improved with optimisation to data representation.
The fact that the cumulative per-module metrics didn't change between -O0 and -O1 indicate to me that there is a methodological problem which needs to be addressed in the test infrastructure. I investigated this a bit and have a hypothesis for what might be going on here; nevertheless, in the interest of publishing these measurements I'm ignoring these measurements for the time being.
I have attached the Jupyter notebook that gave rise to these numbers. This gives a finer-grained breakdown of the data including histograms showing the variance of each metric. Perhaps this will be helpful in better understanding the effects. I would be happy to share my run data as well although it is a bit large.
All-in-all, the Tweag folks have done a great job in squashing the performance numbers noticed a few weeks ago. The current numbers look quite acceptable for GHC 8.12. Congratulations to Arnaud, Krzysztof, and Richard on landing this feature! I'm very much looking forward to see what the community does with it in the coming years.
Cheers,
- Ben
[1] https://github.com/haskell/cabal/issues/5353 [2] https://github.com/haskell/cabal/issues/3883
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
participants (2)
-
Ben Gamari
-
chessai