Re: Measuring performance of GHC

6 Dec 2016

      Michal Terepeta  writes:
...
Interesting! I must have missed this proposal.  It seems that it didn't meet
with much enthusiasm though (but it also proposes to have a completely
separate
repo on github).
Personally, I'd be happy with something more modest:
- A collection of modules/programs that are more representative of real
  Haskell programs and stress various aspects of the compiler.
  (this seems to be a weakness of nofib, where >90% of modules compile
  in less than 0.4s)
This would be great.
...
- A way to compile all of those and do "before and after" comparisons
  easily. To measure the time, we should probably try to compile each
  module at least a few times. (it seems that this is not currently
  possible with `tests/perf/compiler` and
  nofib only compiles the programs once AFAICS)
Looking at the comments on the proposal from Moritz, most people would
prefer to
extend/improve nofib or `tests/perf/compiler` tests. So I guess the main
question is - what would be better:
- Extending nofib with modules that are compile only (i.e., not
  runnable) and focus on stressing the compiler?
- Extending `tests/perf/compiler` with ability to run all the tests and do
  easy "before and after" comparisons?
I don't have a strong opinion on which of these would be better.
However, I would point out that currently the tests/perf/compiler tests
are extremely labor-intensive to maintain while doing relatively little
to catch performance regressions. There are a few issues here:

 * some tests aren't very reproducible between runs, meaning that
   contributors sometimes don't catch regressions in their local
   validations
 * many tests aren't very reproducible between platforms and all tests
   are inconsistent between differing word sizes. This means that we end
   up having many sets of expected performance numbers in the testsuite.
   In practice nearly all of these except 64-bit Linux are out-of-date.
 * our window-based acceptance criterion for performance metrics doesn't
   catch most regressions, which typically bump allocations by a couple
   percent or less (whereas the acceptance thresholds range from 5% to
   20%). This means that the testsuite fails to catch many deltas, only
   failing when some unlucky person finally pushes the number over the
   threshold.

Joachim and I discussed this issue a few months ago at Hac Phi; he had
an interesting approach to tracking expected performance numbers which
may both alleviate these issues and reduce the maintenance burden that
the tests pose. I wrote down some terse notes in #12758.

Cheers,

- Ben

Re: Measuring performance of GHC

Ben Gamari