
On Tue, Dec 6, 2016 at 2:44 AM Ben Gamari
wrote: Michal Terepeta writes: [...]
Looking at the comments on the proposal from Moritz, most people would prefer to extend/improve nofib or `tests/perf/compiler` tests. So I guess the main question is - what would be better: - Extending nofib with modules that are compile only (i.e., not runnable) and focus on stressing the compiler? - Extending `tests/perf/compiler` with ability to run all the tests and
do
easy "before and after" comparisons?
I don't have a strong opinion on which of these would be better. However, I would point out that currently the tests/perf/compiler tests are extremely labor-intensive to maintain while doing relatively little to catch performance regressions. There are a few issues here:
* some tests aren't very reproducible between runs, meaning that contributors sometimes don't catch regressions in their local validations * many tests aren't very reproducible between platforms and all tests are inconsistent between differing word sizes. This means that we end up having many sets of expected performance numbers in the testsuite. In practice nearly all of these except 64-bit Linux are out-of-date. * our window-based acceptance criterion for performance metrics doesn't catch most regressions, which typically bump allocations by a couple percent or less (whereas the acceptance thresholds range from 5% to 20%). This means that the testsuite fails to catch many deltas, only failing when some unlucky person finally pushes the number over the threshold.
Joachim and I discussed this issue a few months ago at Hac Phi; he had an interesting approach to tracking expected performance numbers which may both alleviate these issues and reduce the maintenance burden that the tests pose. I wrote down some terse notes in #12758.
Thanks for mentioning the ticket! To be honest, I'm not a huge fan of having performance tests being treated the same as any other tests. IMHO they are quite different: - They usually need a quiet environment (e.g., cannot run two different tests at the same time). But with ordinary correctness tests, I can run as many as I want concurrently. - The output is not really binary (correct vs incorrect) but some kind of a number (or collection of numbers) that we want to track over time. - The decision whether to fail is harder. Since output might be noisy, you need to have either quite relaxed bounds (and miss small regressions) or try to enforce stronger bounds (and suffer from the flakiness and maintenance overhead). So for the purpose of: "I have a small change and want to check its effect on compiler performance and expect, e.g., ~1% difference" the model running of benchmarks separately from tests is much nicer. I can run them when I'm not doing anything else on the computer and then easily compare the results. (that's what I usually do for nofib). For tracking the performance over time, one could set something up to run the benchmarks when idle. (isn't that's what perf.haskell.org is doing?) Due to that, if we want to extend tests/perf/compiler to support this use case, I think we should include there benchmarks that are *not* tests (and are not included in ./validate), but there's some easy tool to run all of them and give you a quick comparison of what's changed. To a certain degree this would be then orthogonal to the improvements suggested in the ticket. But we could probably reuse some things (e.g., dumping .csv files for perf metrics?) How should we proceed? Should I open a new ticket focused on this? (maybe we could try to figure out all the details there?) Thanks, Michal