Re: Measuring performance of GHC

6 Dec 2016

      ...
On Tue, Dec 6, 2016 at 2:44 AM Ben Gamari  wrote:
Michal Terepeta  writes:
[...]
...
Looking at the comments on the proposal from Moritz, most people would
prefer to
extend/improve nofib or `tests/perf/compiler` tests. So I guess the main
question is - what would be better:
- Extending nofib with modules that are compile only (i.e., not
  runnable) and focus on stressing the compiler?
- Extending `tests/perf/compiler` with ability to run all the tests and
do
...
easy "before and after" comparisons?
I don't have a strong opinion on which of these would be better.
However, I would point out that currently the tests/perf/compiler tests
are extremely labor-intensive to maintain while doing relatively little
to catch performance regressions. There are a few issues here:
* some tests aren't very reproducible between runs, meaning that
  contributors sometimes don't catch regressions in their local
  validations
* many tests aren't very reproducible between platforms and all tests
  are inconsistent between differing word sizes. This means that we end
  up having many sets of expected performance numbers in the testsuite.
  In practice nearly all of these except 64-bit Linux are out-of-date.
* our window-based acceptance criterion for performance metrics doesn't
  catch most regressions, which typically bump allocations by a couple
  percent or less (whereas the acceptance thresholds range from 5% to
  20%). This means that the testsuite fails to catch many deltas, only
  failing when some unlucky person finally pushes the number over the
  threshold.
Joachim and I discussed this issue a few months ago at Hac Phi; he had
an interesting approach to tracking expected performance numbers which
may both alleviate these issues and reduce the maintenance burden that
the tests pose. I wrote down some terse notes in #12758.
Thanks for mentioning the ticket!

To be honest, I'm not a huge fan of having performance tests being treated
the
same as any other tests. IMHO they are quite different:

- They usually need a quiet environment (e.g., cannot run two different
tests at
  the same time). But with ordinary correctness tests, I can run as many as
I
  want concurrently.

- The output is not really binary (correct vs incorrect) but some kind of a
  number (or collection of numbers) that we want to track over time.

- The decision whether to fail is harder. Since output might be noisy, you
  need to have either quite relaxed bounds (and miss small regressions) or
  try to enforce stronger bounds (and suffer from the flakiness and
maintenance
  overhead).

So for the purpose of:
  "I have a small change and want to check its effect on compiler
performance
  and expect, e.g., ~1% difference"
the model running of benchmarks separately from tests is much nicer. I can
run
them when I'm not doing anything else on the computer and then easily
compare
the results. (that's what I usually do for nofib). For tracking the
performance
over time, one could set something up to run the benchmarks when idle.
(isn't
that's what perf.haskell.org is doing?)

Due to that, if we want to extend tests/perf/compiler to support this use
case,
I think we should include there benchmarks that are *not* tests (and are not
included in ./validate), but there's some easy tool to run all of them and
give
you a quick comparison of what's changed.

To a certain degree this would be then orthogonal to the improvements
suggested
in the ticket. But we could probably reuse some things (e.g., dumping .csv
files
for perf metrics?)

How should we proceed? Should I open a new ticket focused on this? (maybe we
could try to figure out all the details there?)

Thanks,
Michal