
Sebastian Graf
Hi Andreas,
I similarly benchmark compiler performance by compiling Cabal, but only occasionally. I mostly trust ghc/alloc metrics in CI and check Cabal when I think there's something afoot and/or want to measure runtime, not only allocations.
I think this is a very reasonable strategy. When working explicitly on compiler performance I generally default to the Cabal test as 1. I find the 20 or 90 seconds (depending upon optimisation level) that it takes is small relative to the time it took to actually find the issue I am trying to fix, and 2. I want to be certain I am not sacrificing compiler performance in one case in exchange for improvements elsewhere; the nofib tests are so small that I find it hard to convince myself that this is the case.
I'm inclined to think that for my purposes (testing the impact of optimisations) the GHC codebase offers sufficient variety to turn up fundamental regressions, but maybe it makes sense to build some packages from head.hackage to detect regressions like https://gitlab.haskell.org/ghc/ghc/-/issues/19203 earlier. It's all a bit open-ended and I frankly think I wouldn't get done anything if all my patches would have to get to the bottom of all regressions and improvements on the entire head.hackage set. I somewhat trust that users will complain eventually and file a bug report and that our CI efforts mean that compiler performance will improve in the mean.
Although it's probably more of a tooling problem: I simply don't know how to collect the compiler performance metrics for arbitrary cabal packages. If these metrics would be collected as part of CI, maybe as a nightly or weekly job, it would be easier to get to the bottom of a regression before it manifests in a released GHC version. But it all depends on how easy that would be to set up and how many CI cycles it would burn, and I certainly don't feel like I'm in a position to answer either question.
We actually already do this in head.hackage: every GHC commit on `master` runs `head.hackage` with -ddump-timings. The compiler metrics that result are then dumped into a database, which can be queried via Postgrest. IIRC, I described this in an email to ghc-devs a few months ago. Unfortunately, Ryan and I have thusfar found it very difficult to keep head.hackage and the associated infrastructure building reliably enough to make this a useful long-term metric. I do hope we can do better in the future; I suspect we will want to be better about marking MRs that may break user code with ~"user facing", allowing us to ensure that head.hackage is updated *before* the change makes it into `master`. Cheers, - Ben