
Ben Gamari
Hi all,
Recently our performance tests have been causing quite some pain. One reason for this is due to our new Darwin runners (see #19025), which (surprisingly) differ significantly in their performance characteristics (perhaps due to running Big Sur or using native tools provided by nix?).
However, this is further exacerbated by the fact that there are quite a few people working on compiler performance currently (horray!). This leads to the following failure mode during Marge jobs:
1. Merge request A improves test T1234 by 0.5%, which is within the test's acceptance window and therefore CI passes
2. Merge request B *also* improves test T1234 by another 0.5%, which similarly passes CI
3. Marge tries to merge MRs A and B in a batch but finds that the combined 1% improvement in T1234 is *outside* the acceptance window. Consequently, the batch fails.
This is quite painful, especially given that it creates work for those trying to improve GHC (as the saying goes: no good deed goes unpunished).
To mitigate this I would suggest that we allow performance test failures in marge-bot pipelines. A slightly weaker variant of this idea would instead only allow performance *improvements*. I suspect the latter would get most of the benefit, while eliminating the possibility that a large regression goes unnoticed.
To get things un-stuck I have disabled the affected tests on Darwin for the time being. I hope we will be able to reenable these tests when we have migrated fully to the new runners although only time will tell. I will try to rebase the open MRs that are currently failing only due to spurious performance failures but please do feel free to hit rebase yourself if I miss any. Cheers, - Ben