Re: Changes to performance testing?

23 Feb 2021

      Ben Gamari  writes:
...
Hi all,
Recently our performance tests have been causing quite some pain. One
reason for this is due to our new Darwin runners (see #19025), which
(surprisingly) differ significantly in their performance characteristics
(perhaps due to running Big Sur or using native tools provided by nix?).
However, this is further exacerbated by the fact that there are quite a
few people working on compiler performance currently (horray!). This
leads to the following failure mode during Marge jobs:
1. Merge request A improves test T1234 by 0.5%, which is within the
    test's acceptance window and therefore CI passes
2. Merge request B *also* improves test T1234 by another 0.5%, which
    similarly passes CI
3. Marge tries to merge MRs A and B in a batch but finds that the
    combined 1% improvement in T1234 is *outside* the acceptance window.
    Consequently, the batch fails.
This is quite painful, especially given that it creates work for those
trying to improve GHC (as the saying goes: no good deed goes
unpunished).
To mitigate this I would suggest that we allow performance test failures
in marge-bot pipelines. A slightly weaker variant of this idea would
instead only allow performance *improvements*. I suspect the latter
would get most of the benefit, while eliminating the possibility that a
large regression goes unnoticed.
To get things un-stuck I have disabled the affected tests on Darwin for
the time being. I hope we will be able to reenable these tests when we
have migrated fully to the new runners although only time will tell.

I will try to rebase the open MRs that are currently failing only due to
spurious performance failures but please do feel free to hit rebase
yourself if I miss any.

Cheers,

- Ben