Changes to performance testing?

21 Feb 2021

      Hi all,

Recently our performance tests have been causing quite some pain. One
reason for this is due to our new Darwin runners (see #19025), which
(surprisingly) differ significantly in their performance characteristics
(perhaps due to running Big Sur or using native tools provided by nix?).

However, this is further exacerbated by the fact that there are quite a
few people working on compiler performance currently (horray!). This
leads to the following failure mode during Marge jobs:

 1. Merge request A improves test T1234 by 0.5%, which is within the
    test's acceptance window and therefore CI passes

 2. Merge request B *also* improves test T1234 by another 0.5%, which
    similarly passes CI

 3. Marge tries to merge MRs A and B in a batch but finds that the
    combined 1% improvement in T1234 is *outside* the acceptance window.
    Consequently, the batch fails.

This is quite painful, especially given that it creates work for those
trying to improve GHC (as the saying goes: no good deed goes
unpunished). 

To mitigate this I would suggest that we allow performance test failures
in marge-bot pipelines. A slightly weaker variant of this idea would
instead only allow performance *improvements*. I suspect the latter
would get most of the benefit, while eliminating the possibility that a
large regression goes unnoticed.

Thoughts?

Cheers,

- Ben

Ben Gamari

Richard Eisenberg

Andreas Klebinger

John Ericson

Ben Gamari

tags

participants (4)