I am not advocating to drop perf tests during merge requests, I just want them to not be fatal for marge batches. Yes this means that a bunch of unrelated merge requests all could be fine wrt to the perf checks per merge request, but the aggregate might fail perf. And then subsequently the next MR against the merged aggregate will start failing. Even that is a pretty bad situation imo.
I honestly don't have a good answer, I just see marge work on batches, over and over and over again, just to fail. Eventually marge should figure out a subset of the merges that fit into the perf window, but that might be after 10 tries? So after up to ~30+hours?, which means there won't be any merge request landing in GHC for 30hs. I find that rather unacceptable.
I think we need better visualisation of perf regressions that happen on master. Ben has some wip for this, and I think John said there might be some way to add a nice (maybe reflex) ui to it. If we can see regressions on master easily, and go from "ohh this point in time GHC got worse", to "this is the commit". We might be able to figure it out.
But what do we expect of patch authors? Right now if five people write patches to GHC, and each of them eventually manage to get their MRs green, after a long review, they finally see it assigned to marge, and then it starts failing? Their patch on its own was fine, but their aggregate with other people's code leads to regressions? So we now expect all patch authors together to try to figure out what happened? Figuring out why something regressed is hard enough, and we only have a very few people who are actually capable of debugging this. Thus I believe it would end up with Ben, Andreas, Matthiew, Simon, ... or someone else from GHC HQ anyway to figure out why it regressed, be it in the Review Stage, or dissecting a marge aggregate, or on master.
Thus I believe in most cases we'd have to look at the regressions anyway, and right now we just convolutedly make working on GHC a rather depressing job. Increasing the barrier to entry by also requiring everyone to have absolutely stellar perf regression skills is quite a challenge.
There is also the question of our synthetic benchmarks actually measuring real world performance? Do the micro benchmarks translate to the same regressions in say building aeson, vector or Cabal? The latter being what most practitioners care about more than the micro benchmarks.
Again, I'm absolutely not in favour of GHC regressing, it's slow enough as it is. I just think CI should be assisting us and not holding development back.
Cheers,
Moritz