
Simon Peyton Jones via ghc-devs
We need to do something about this, and I'd advocate for just not making stats fail with marge.
Generally I agree. One point you don’t mention is that our perf tests (which CI forces us to look at assiduously) are often pretty weird cases. So there is at least a danger that these more exotic cases will stand in the way of (say) a perf improvement in the typical case.
But “not making stats fail” is a bit crude. Instead how about
To be clear, the proposal isn't to accept stats failures for merge request validation jobs. I believe Moritz was merely suggesting that we accept such failures in marge-bot validations (that is, the pre-merge validation done on batches of merge requests). In my opinion this is reasonable since we know that all of the MRs in the batch do not individually regress. While it's possible that interactions between two or more MRs result in a qualitative change in performance, it seems quite unlikely. What is far *more* likely (and what we see regularly) is that the cumulative effect of a batch of improving patches pushes the batches' overall stat change out of the acceptance threshold. This is quite annoying as it dooms the entire batch. For this reason, I think we should at very least accept stat improvements during Marge validations (as you suggest). I agree that we probably want a batch to fail if two patches accumulate to form a regression, even if the two passed CI individually.
* We already have per-benchmark windows. If the stat falls outside the window, we fail. You are effectively saying “widen all windows to infinity”. If something makes a stat 10 times worse, I think we *should* fail. But 10% worse? Maybe we should accept and look later as you suggest. So I’d argue for widening the windows rather than disabling them completely.
Yes, I agree.
* If we did that we’d need good instrumentation to spot steps and drift in perf, as you say. An advantage is that since the perf instrumentation runs only on committed master patches, not on every CI, it can cost more. In particular , it could run a bunch of “typical” tests, including nofib and compiling Cabal or other libraries.
We already have the beginnings of such instrumentation.
The big danger is that by relieving patch authors from worrying about perf drift, it’ll end up in the lap of the GHC HQ team. If it’s hard for the author of a single patch (with which she is intimately familiar) to work out why it’s making some test 2% worse, imagine how hard, and demotivating, it’d be for Ben to wonder why 50 patches (with which he is unfamiliar) are making some test 5% worse.
Yes, I absolutely agree with this. I would very much like to avoid having to do this sort of post-hoc investigation any more than necessary. Cheers, - Ben