RE: On CI

18 Mar 2021

      Simon Peyton Jones via ghc-devs  writes:
...
...
We need to do something about this, and I'd advocate for just not making stats fail with marge.
Generally I agree. One point you don’t mention is that our perf tests
(which CI forces us to look at assiduously) are often pretty weird
cases. So there is at least a danger that these more exotic cases will
stand in the way of (say) a perf improvement in the typical case.
But “not making stats fail” is a bit crude.   Instead how about
To be clear, the proposal isn't to accept stats failures for merge request
validation jobs. I believe Moritz was merely suggesting that we accept
such failures in marge-bot validations (that is, the pre-merge
validation done on batches of merge requests).

In my opinion this is reasonable since we know that all of the MRs in
the batch do not individually regress. While it's possible that
interactions between two or more MRs result in a qualitative change in
performance, it seems quite unlikely. What is far *more* likely (and
what we see regularly) is that the cumulative effect of a batch of
improving patches pushes the batches' overall stat change out of the
acceptance threshold. This is quite annoying as it dooms the entire
batch.

For this reason, I think we should at very least accept stat
improvements during Marge validations (as you suggest). I agree that we
probably want a batch to fail if two patches accumulate to form a
regression, even if the two passed CI individually.
...
* We already have per-benchmark windows. If the stat falls outside
  the window, we fail. You are effectively saying “widen all windows
  to infinity”. If something makes a stat 10 times worse, I think we
  *should* fail. But 10% worse? Maybe we should accept and look later
  as you suggest. So I’d argue for widening the windows rather than
  disabling them completely.
Yes, I agree.
...
* If we did that we’d need good instrumentation to spot steps and
  drift in perf, as you say. An advantage is that since the perf
  instrumentation runs only on committed master patches, not on every
  CI, it can cost more. In particular , it could run a bunch of
  “typical” tests, including nofib and compiling Cabal or other
  libraries.
We already have the beginnings of such instrumentation.
...
The big danger is that by relieving patch authors from worrying about
perf drift, it’ll end up in the lap of the GHC HQ team. If it’s hard
for the author of a single patch (with which she is intimately
familiar) to work out why it’s making some test 2% worse, imagine how
hard, and demotivating, it’d be for Ben to wonder why 50 patches (with
which he is unfamiliar) are making some test 5% worse.
Yes, I absolutely agree with this. I would very much like to avoid
having to do this sort of post-hoc investigation any more than
necessary.

Cheers,

- Ben

RE: On CI

Ben Gamari