
Hi all, Recently our performance tests have been causing quite some pain. One reason for this is due to our new Darwin runners (see #19025), which (surprisingly) differ significantly in their performance characteristics (perhaps due to running Big Sur or using native tools provided by nix?). However, this is further exacerbated by the fact that there are quite a few people working on compiler performance currently (horray!). This leads to the following failure mode during Marge jobs: 1. Merge request A improves test T1234 by 0.5%, which is within the test's acceptance window and therefore CI passes 2. Merge request B *also* improves test T1234 by another 0.5%, which similarly passes CI 3. Marge tries to merge MRs A and B in a batch but finds that the combined 1% improvement in T1234 is *outside* the acceptance window. Consequently, the batch fails. This is quite painful, especially given that it creates work for those trying to improve GHC (as the saying goes: no good deed goes unpunished). To mitigate this I would suggest that we allow performance test failures in marge-bot pipelines. A slightly weaker variant of this idea would instead only allow performance *improvements*. I suspect the latter would get most of the benefit, while eliminating the possibility that a large regression goes unnoticed. Thoughts? Cheers, - Ben

On Feb 21, 2021, at 11:24 AM, Ben Gamari
wrote: To mitigate this I would suggest that we allow performance test failures in marge-bot pipelines. A slightly weaker variant of this idea would instead only allow performance *improvements*. I suspect the latter would get most of the benefit, while eliminating the possibility that a large regression goes unnoticed.
The value in making performance improvements a test failure is so that patch authors can be informed of what they have done, to make sure it matches expectations. This need can reasonably be satisfied without stopping merging. That is, if Marge can accept performance improvements, while (say) posting to each MR involved that it may have contributed to a performance improvement, then I think we've done our job here. On the other hand, a performance degradation is a bug, just like, say, an error message regression. Even if it's a combination of commits that cause the problem (an actual possibility even for error message regressions), it's still a bug that we need to either fix or accept (balanced out by other improvements). The pain of debugging this scenario might be mitigated if there were a collation of the performance wibbles for each individual commit. This information is, in general, available: each commit passed CI on its own, and so it should be possible to create a little report with its rows being perf tests and its columns being commits or MR #s; each cell in the table would be a percentage regression. If we're lucky, the regression Marge sees will be the sum(*) of the entries in one of the rows -- this means that we have a simple agglomeration of performance degradation. If we're less lucky, the whole will not equal the sum of the parts, and some of the patches interfere. In either case, the table would suggest a likely place to look next. (*) I suppose if we're recording percentages, it wouldn't necessarily be the actual sum, because percentages are a bit funny. But you get my meaning. Pulling this all together: * I'm against the initial proposal of allowing all performance failures by Marge. This will allow bugs to accumulate (in my opinion). * I'm in favor of allowing performance improvements to be accepted by Marge. * To mitigate against the information loss of Marge accepting performance improvements, it would be great if Marge could alert MR authors that a cumulative performance improvement took place. * To mitigate against the annoyance of finding a performance regression in a merge commit that does not appear in any component commit, it would be great if there were a tool to collect performance numbers from a set of commits and present them in a table for further analysis. These "mitigations" might take work. If labor is impossible to produce to complete this work, I'm in favor of simply allowing the performance improvements, maybe also filing a ticket about these potential improvements to the process. Richard

This seems quite reasonable to me. Not sure about the cost of implementing it (and the feasability of it if/when merge-trains arrive). Andreas Am 21/02/2021 um 21:31 schrieb Richard Eisenberg:
On Feb 21, 2021, at 11:24 AM, Ben Gamari
mailto:ben@well-typed.com> wrote: To mitigate this I would suggest that we allow performance test failures in marge-bot pipelines. A slightly weaker variant of this idea would instead only allow performance *improvements*. I suspect the latter would get most of the benefit, while eliminating the possibility that a large regression goes unnoticed.
The value in making performance improvements a test failure is so that patch authors can be informed of what they have done, to make sure it matches expectations. This need can reasonably be satisfied without stopping merging. That is, if Marge can accept performance improvements, while (say) posting to each MR involved that it may have contributed to a performance improvement, then I think we've done our job here.
On the other hand, a performance degradation is a bug, just like, say, an error message regression. Even if it's a combination of commits that cause the problem (an actual possibility even for error message regressions), it's still a bug that we need to either fix or accept (balanced out by other improvements). The pain of debugging this scenario might be mitigated if there were a collation of the performance wibbles for each individual commit. This information is, in general, available: each commit passed CI on its own, and so it should be possible to create a little report with its rows being perf tests and its columns being commits or MR #s; each cell in the table would be a percentage regression. If we're lucky, the regression Marge sees will be the sum(*) of the entries in one of the rows -- this means that we have a simple agglomeration of performance degradation. If we're less lucky, the whole will not equal the sum of the parts, and some of the patches interfere. In either case, the table would suggest a likely place to look next.
(*) I suppose if we're recording percentages, it wouldn't necessarily be the actual sum, because percentages are a bit funny. But you get my meaning.
Pulling this all together: * I'm against the initial proposal of allowing all performance failures by Marge. This will allow bugs to accumulate (in my opinion). * I'm in favor of allowing performance improvements to be accepted by Marge. * To mitigate against the information loss of Marge accepting performance improvements, it would be great if Marge could alert MR authors that a cumulative performance improvement took place. * To mitigate against the annoyance of finding a performance regression in a merge commit that does not appear in any component commit, it would be great if there were a tool to collect performance numbers from a set of commits and present them in a table for further analysis.
These "mitigations" might take work. If labor is impossible to produce to complete this work, I'm in favor of simply allowing the performance improvements, maybe also filing a ticket about these potential improvements to the process.
Richard
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

If we make the changes proposed in the "On CI" thread, I think we solve most of this for free. If a perf test fails, resubmitting should not need to rebuild GHC because it is cached. I'd say at that point, it's not even worth making Marge bot auto-accept extra improvements because restarting with new perf windows should be so easy. John On 2/22/21 8:46 AM, Andreas Klebinger wrote:
This seems quite reasonable to me. Not sure about the cost of implementing it (and the feasability of it if/when merge-trains arrive).
Andreas
Am 21/02/2021 um 21:31 schrieb Richard Eisenberg:
On Feb 21, 2021, at 11:24 AM, Ben Gamari
mailto:ben@well-typed.com> wrote: To mitigate this I would suggest that we allow performance test failures in marge-bot pipelines. A slightly weaker variant of this idea would instead only allow performance *improvements*. I suspect the latter would get most of the benefit, while eliminating the possibility that a large regression goes unnoticed.
The value in making performance improvements a test failure is so that patch authors can be informed of what they have done, to make sure it matches expectations. This need can reasonably be satisfied without stopping merging. That is, if Marge can accept performance improvements, while (say) posting to each MR involved that it may have contributed to a performance improvement, then I think we've done our job here.
On the other hand, a performance degradation is a bug, just like, say, an error message regression. Even if it's a combination of commits that cause the problem (an actual possibility even for error message regressions), it's still a bug that we need to either fix or accept (balanced out by other improvements). The pain of debugging this scenario might be mitigated if there were a collation of the performance wibbles for each individual commit. This information is, in general, available: each commit passed CI on its own, and so it should be possible to create a little report with its rows being perf tests and its columns being commits or MR #s; each cell in the table would be a percentage regression. If we're lucky, the regression Marge sees will be the sum(*) of the entries in one of the rows -- this means that we have a simple agglomeration of performance degradation. If we're less lucky, the whole will not equal the sum of the parts, and some of the patches interfere. In either case, the table would suggest a likely place to look next.
(*) I suppose if we're recording percentages, it wouldn't necessarily be the actual sum, because percentages are a bit funny. But you get my meaning.
Pulling this all together: * I'm against the initial proposal of allowing all performance failures by Marge. This will allow bugs to accumulate (in my opinion). * I'm in favor of allowing performance improvements to be accepted by Marge. * To mitigate against the information loss of Marge accepting performance improvements, it would be great if Marge could alert MR authors that a cumulative performance improvement took place. * To mitigate against the annoyance of finding a performance regression in a merge commit that does not appear in any component commit, it would be great if there were a tool to collect performance numbers from a set of commits and present them in a table for further analysis.
These "mitigations" might take work. If labor is impossible to produce to complete this work, I'm in favor of simply allowing the performance improvements, maybe also filing a ticket about these potential improvements to the process.
Richard
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Ben Gamari
Hi all,
Recently our performance tests have been causing quite some pain. One reason for this is due to our new Darwin runners (see #19025), which (surprisingly) differ significantly in their performance characteristics (perhaps due to running Big Sur or using native tools provided by nix?).
However, this is further exacerbated by the fact that there are quite a few people working on compiler performance currently (horray!). This leads to the following failure mode during Marge jobs:
1. Merge request A improves test T1234 by 0.5%, which is within the test's acceptance window and therefore CI passes
2. Merge request B *also* improves test T1234 by another 0.5%, which similarly passes CI
3. Marge tries to merge MRs A and B in a batch but finds that the combined 1% improvement in T1234 is *outside* the acceptance window. Consequently, the batch fails.
This is quite painful, especially given that it creates work for those trying to improve GHC (as the saying goes: no good deed goes unpunished).
To mitigate this I would suggest that we allow performance test failures in marge-bot pipelines. A slightly weaker variant of this idea would instead only allow performance *improvements*. I suspect the latter would get most of the benefit, while eliminating the possibility that a large regression goes unnoticed.
To get things un-stuck I have disabled the affected tests on Darwin for the time being. I hope we will be able to reenable these tests when we have migrated fully to the new runners although only time will tell. I will try to rebase the open MRs that are currently failing only due to spurious performance failures but please do feel free to hit rebase yourself if I miss any. Cheers, - Ben
participants (4)
-
Andreas Klebinger
-
Ben Gamari
-
John Ericson
-
Richard Eisenberg