Measuring performance of GHC

older
timings for individual compiler...

Michal Terepeta

4 Dec 2016 4 Dec '16

7:47 p.m.

Hi everyone, I've been running nofib a few times recently to see the effect of some changes on compile time (not the runtime of the compiled program). And I've started wondering how representative nofib is when it comes to measuring compile time and compiler allocations? It seems that most of the nofib programs compile really quickly... Is there some collections of modules/libraries/applications that were put together with the purpose of benchmarking GHC itself and I just haven't seen/found it? If not, maybe we should create something? IMHO it sounds reasonable to have separate benchmarks for: - Performance of GHC itself. - Performance of the code generated by GHC. Thanks, Michal

Attachments:

attachment.html (text/html — 970 bytes)

Show replies by date

Alan & Kim Zimmerman

4 Dec 4 Dec

7:50 p.m.

I agree. I find compilation time on things with large data structures, such as working with the GHC AST via the GHC API get pretty slow. To the point where I have had to explicitly disable optimisation on HaRe, otherwise the build takes too long. Alan On Sun, Dec 4, 2016 at 9:47 PM, Michal Terepeta wrote:

...

Hi everyone,

I've been running nofib a few times recently to see the effect of some changes on compile time (not the runtime of the compiled program). And I've started wondering how representative nofib is when it comes to measuring compile time and compiler allocations? It seems that most of the nofib programs compile really quickly...

Is there some collections of modules/libraries/applications that were put together with the purpose of benchmarking GHC itself and I just haven't seen/found it?

If not, maybe we should create something? IMHO it sounds reasonable to have separate benchmarks for: - Performance of GHC itself. - Performance of the code generated by GHC.

Thanks, Michal

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

David Turner

8:04 p.m.

Nod nod. amazonka-ec2 has a particularly painful module containing just a couple of hundred type definitions and associated instances and stuff. None of the types is enormous. There's an issue open on GitHub[1] where I've guessed at some possible better ways of splitting the types up to make GHC's life easier, but it'd be great if it didn't need any such shenanigans. It's a bit of a pathological case: auto-generated 15kLoC and lots of deriving, but I still feel it should be possible to compile with less than 2.8GB RSS. [1] https://github.com/brendanhay/amazonka/issues/304 Cheers, David On 4 Dec 2016 19:51, "Alan & Kim Zimmerman" wrote: I agree. I find compilation time on things with large data structures, such as working with the GHC AST via the GHC API get pretty slow. To the point where I have had to explicitly disable optimisation on HaRe, otherwise the build takes too long. Alan On Sun, Dec 4, 2016 at 9:47 PM, Michal Terepeta wrote:

...

Hi everyone,

I've been running nofib a few times recently to see the effect of some changes on compile time (not the runtime of the compiled program). And I've started wondering how representative nofib is when it comes to measuring compile time and compiler allocations? It seems that most of the nofib programs compile really quickly...

Is there some collections of modules/libraries/applications that were put together with the purpose of benchmarking GHC itself and I just haven't seen/found it?

If not, maybe we should create something? IMHO it sounds reasonable to have separate benchmarks for: - Performance of GHC itself. - Performance of the code generated by GHC.

Thanks, Michal

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Joachim Breitner

9:52 p.m.

Hi, did you try to compile it with a profiled GHC and look at the report? I would not be surprised if it would point to some obvious sub-optimal algorithms in GHC. Greetings, Joachim Am Sonntag, den 04.12.2016, 20:04 +0000 schrieb David Turner:

...

Nod nod.

amazonka-ec2 has a particularly painful module containing just a couple of hundred type definitions and associated instances and stuff. None of the types is enormous. There's an issue open on GitHub[1] where I've guessed at some possible better ways of splitting the types up to make GHC's life easier, but it'd be great if it didn't need any such shenanigans. It's a bit of a pathological case: auto-generated 15kLoC and lots of deriving, but I still feel it should be possible to compile with less than 2.8GB RSS. [1] https://github.com/brendanhay/amazonka/issues/304

Cheers,

David

On 4 Dec 2016 19:51, "Alan & Kim Zimmerman" wrote: I agree.

I find compilation time on things with large data structures, such as working with the GHC AST via the GHC API get pretty slow.

To the point where I have had to explicitly disable optimisation on HaRe, otherwise the build takes too long.

Alan

On Sun, Dec 4, 2016 at 9:47 PM, Michal Terepeta wrote:

...
Hi everyone,

I've been running nofib a few times recently to see the effect of some changes on compile time (not the runtime of the compiled program). And I've started wondering how representative nofib is when it comes to measuring compile time and compiler allocations? It seems that most of the nofib programs compile really quickly...

Is there some collections of modules/libraries/applications that were put together with the purpose of benchmarking GHC itself and I just haven't seen/found it?

If not, maybe we should create something? IMHO it sounds reasonable to have separate benchmarks for: - Performance of GHC itself. - Performance of the code generated by GHC.

Thanks, Michal

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs -- Joachim “nomeata” Breitner mail@joachim-breitner.de • https://www.joachim-breitner.de/ XMPP: nomeata@joachim-breitner.de • OpenPGP-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

David Turner

9:57 p.m.

Seems like a good idea, for sure. I have not, but I might eventually. On 4 Dec 2016 21:52, "Joachim Breitner" wrote:

...

Hi,

did you try to compile it with a profiled GHC and look at the report? I would not be surprised if it would point to some obvious sub-optimal algorithms in GHC.

Greetings, Joachim

Am Sonntag, den 04.12.2016, 20:04 +0000 schrieb David Turner:

...
Nod nod.

amazonka-ec2 has a particularly painful module containing just a couple of hundred type definitions and associated instances and stuff. None of the types is enormous. There's an issue open on GitHub[1] where I've guessed at some possible better ways of splitting the types up to make GHC's life easier, but it'd be great if it didn't need any such shenanigans. It's a bit of a pathological case: auto-generated 15kLoC and lots of deriving, but I still feel it should be possible to compile with less than 2.8GB RSS.

[1] https://github.com/brendanhay/amazonka/issues/304

Cheers,

David

On 4 Dec 2016 19:51, "Alan & Kim Zimmerman" wrote: I agree.

I find compilation time on things with large data structures, such as working with the GHC AST via the GHC API get pretty slow.

To the point where I have had to explicitly disable optimisation on HaRe, otherwise the build takes too long.

Alan

On Sun, Dec 4, 2016 at 9:47 PM, Michal Terepeta wrote:

...
Hi everyone,

I've been running nofib a few times recently to see the effect of some changes on compile time (not the runtime of the compiled program). And I've started wondering how representative nofib is when it comes to measuring compile time and compiler allocations? It seems that most of the nofib programs compile really quickly...

Is there some collections of modules/libraries/applications that were put together with the purpose of benchmarking GHC itself and I just haven't seen/found it?

If not, maybe we should create something? IMHO it sounds reasonable to have separate benchmarks for: - Performance of GHC itself. - Performance of the code generated by GHC.

Thanks, Michal

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs -- Joachim “nomeata” Breitner mail@joachim-breitner.de • https://www.joachim-breitner.de/ XMPP: nomeata@joachim-breitner.de • OpenPGP-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Simon Peyton Jones

5 Dec 5 Dec

10:31 a.m.

If not, maybe we should create something? IMHO it sounds reasonable to have separate benchmarks for: - Performance of GHC itself. - Performance of the code generated by GHC. I think that would be great, Michael. We have a small and unrepresentative sample in testsuite/tests/perf/compiler Simon From: ghc-devs [mailto:ghc-devs-bounces@haskell.org] On Behalf Of Michal Terepeta Sent: 04 December 2016 19:47 To: ghc-devs Subject: Measuring performance of GHC Hi everyone, I've been running nofib a few times recently to see the effect of some changes on compile time (not the runtime of the compiled program). And I've started wondering how representative nofib is when it comes to measuring compile time and compiler allocations? It seems that most of the nofib programs compile really quickly... Is there some collections of modules/libraries/applications that were put together with the purpose of benchmarking GHC itself and I just haven't seen/found it? If not, maybe we should create something? IMHO it sounds reasonable to have separate benchmarks for: - Performance of GHC itself. - Performance of the code generated by GHC. Thanks, Michal

Moritz Angermann

10:59 a.m.

Hi, I’ve started the GHC Performance Regression Collection Proposal[1] (Rendered [2]) a while ago with the idea of having a trivially community curated set of small[3] real-world examples with performance regressions. I might be at fault here for not describing this to the best of my abilities. Thus if there is interested, and this sounds like an useful idea, maybe we should still pursue this proposal? Cheers, moritz [1]: https://github.com/ghc-proposals/ghc-proposals/pull/26 [2]: https://github.com/angerman/ghc-proposals/blob/prop/perf-regression/proposal... [3]: for some definition of small

...

On Dec 5, 2016, at 6:31 PM, Simon Peyton Jones via ghc-devs wrote:

If not, maybe we should create something? IMHO it sounds reasonable to have

separate benchmarks for:

- Performance of GHC itself.

- Performance of the code generated by GHC.

I think that would be great, Michael. We have a small and unrepresentative sample in testsuite/tests/perf/compiler

Simon

From: ghc-devs [mailto:ghc-devs-bounces@haskell.org] On Behalf Of Michal Terepeta Sent: 04 December 2016 19:47 To: ghc-devs Subject: Measuring performance of GHC

Hi everyone,

I've been running nofib a few times recently to see the effect of some changes

on compile time (not the runtime of the compiled program). And I've started

wondering how representative nofib is when it comes to measuring compile time

and compiler allocations? It seems that most of the nofib programs compile

really quickly...

Is there some collections of modules/libraries/applications that were put

together with the purpose of benchmarking GHC itself and I just haven't

seen/found it?

If not, maybe we should create something? IMHO it sounds reasonable to have

separate benchmarks for:

- Performance of GHC itself.

- Performance of the code generated by GHC.

Thanks,

Michal

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Michal Terepeta

8:21 p.m.

On Mon, Dec 5, 2016 at 12:00 PM Moritz Angermann wrote:

...

Hi,

I’ve started the GHC Performance Regression Collection Proposal[1] (Rendered [2]) a while ago with the idea of having a trivially community curated set of small[3] real-world examples with performance regressions. I might be at fault here for not describing this to the best of my abilities. Thus if there is interested, and this sounds like an useful idea, maybe we should still pursue this proposal?

Cheers, moritz

[1]: https://github.com/ghc-proposals/ghc-proposals/pull/26 [2]: https://github.com/angerman/ghc-proposals/blob/prop/perf-regression/proposal... [3]: for some definition of small

Interesting! I must have missed this proposal. It seems that it didn't meet with much enthusiasm though (but it also proposes to have a completely separate repo on github). Personally, I'd be happy with something more modest: - A collection of modules/programs that are more representative of real Haskell programs and stress various aspects of the compiler. (this seems to be a weakness of nofib, where >90% of modules compile in less than 0.4s) - A way to compile all of those and do "before and after" comparisons easily. To measure the time, we should probably try to compile each module at least a few times. (it seems that this is not currently possible with `tests/perf/compiler` and nofib only compiles the programs once AFAICS) Looking at the comments on the proposal from Moritz, most people would prefer to extend/improve nofib or `tests/perf/compiler` tests. So I guess the main question is - what would be better: - Extending nofib with modules that are compile only (i.e., not runnable) and focus on stressing the compiler? - Extending `tests/perf/compiler` with ability to run all the tests and do easy "before and after" comparisons? Personally, I'm slightly leaning towards `tests/perf/compiler` since this would allow sharing the same module as a test for `validate` and to be used for comparing the performance of the compiler before and after a change. What do you think? Thanks, Michal

Ben Gamari

6 Dec 6 Dec

1:44 a.m.

Michal Terepeta writes:

...

Interesting! I must have missed this proposal. It seems that it didn't meet with much enthusiasm though (but it also proposes to have a completely separate repo on github).

Personally, I'd be happy with something more modest: - A collection of modules/programs that are more representative of real Haskell programs and stress various aspects of the compiler. (this seems to be a weakness of nofib, where >90% of modules compile in less than 0.4s)

This would be great.

...

- A way to compile all of those and do "before and after" comparisons easily. To measure the time, we should probably try to compile each module at least a few times. (it seems that this is not currently possible with `tests/perf/compiler` and nofib only compiles the programs once AFAICS)

Looking at the comments on the proposal from Moritz, most people would prefer to extend/improve nofib or `tests/perf/compiler` tests. So I guess the main question is - what would be better: - Extending nofib with modules that are compile only (i.e., not runnable) and focus on stressing the compiler? - Extending `tests/perf/compiler` with ability to run all the tests and do easy "before and after" comparisons?

I don't have a strong opinion on which of these would be better. However, I would point out that currently the tests/perf/compiler tests are extremely labor-intensive to maintain while doing relatively little to catch performance regressions. There are a few issues here: * some tests aren't very reproducible between runs, meaning that contributors sometimes don't catch regressions in their local validations * many tests aren't very reproducible between platforms and all tests are inconsistent between differing word sizes. This means that we end up having many sets of expected performance numbers in the testsuite. In practice nearly all of these except 64-bit Linux are out-of-date. * our window-based acceptance criterion for performance metrics doesn't catch most regressions, which typically bump allocations by a couple percent or less (whereas the acceptance thresholds range from 5% to 20%). This means that the testsuite fails to catch many deltas, only failing when some unlucky person finally pushes the number over the threshold. Joachim and I discussed this issue a few months ago at Hac Phi; he had an interesting approach to tracking expected performance numbers which may both alleviate these issues and reduce the maintenance burden that the tests pose. I wrote down some terse notes in #12758. Cheers, - Ben

Moritz Angermann

4:07 a.m.

Hi, I see the following challenges here, which have partially be touched by the discussion in the mentioned proposal. - The tests we are looking at, might be quite time intensive (lots of modules that take substantial time to compile). Is this practical to run when people locally execute nofib to get *some* idea of the performance implications? Where is the threshold for the total execution time on running nofib? - One of the core issues I see in day to day programming (even though not necessarily with haskell right now) is that the spare time I have to file bug reports, boil down performance regressions etc. and file them with open source projects is not paid for and hence minimal. Hence whenever the tools I use make it really easy for me to file a bug, performance regression or fix something that takes the least time the chances of me being able to help out increase greatly. This was one of the ideas behind using just pull requests. E.g. This code seems to be really slow, or has subjectively regressed in compilation time. I also feel confident I can legally share this code snipped. So I just create a quick pull request with a short description, and then carry on with what ever pressing task I’m trying to solve right now. - Making sure that measurements are reliable. (E.g. running on a dedicated machine with no other applications interfering.) I assume Joachim has quite some experience here. Thanks. Cheers, Moritz

...

On Dec 6, 2016, at 9:44 AM, Ben Gamari wrote:

Michal Terepeta writes:

...
Interesting! I must have missed this proposal. It seems that it didn't meet with much enthusiasm though (but it also proposes to have a completely separate repo on github).

Personally, I'd be happy with something more modest: - A collection of modules/programs that are more representative of real Haskell programs and stress various aspects of the compiler. (this seems to be a weakness of nofib, where >90% of modules compile in less than 0.4s)

This would be great.

...
- A way to compile all of those and do "before and after" comparisons easily. To measure the time, we should probably try to compile each module at least a few times. (it seems that this is not currently possible with `tests/perf/compiler` and nofib only compiles the programs once AFAICS)

Looking at the comments on the proposal from Moritz, most people would prefer to extend/improve nofib or `tests/perf/compiler` tests. So I guess the main question is - what would be better: - Extending nofib with modules that are compile only (i.e., not runnable) and focus on stressing the compiler? - Extending `tests/perf/compiler` with ability to run all the tests and do easy "before and after" comparisons?

I don't have a strong opinion on which of these would be better. However, I would point out that currently the tests/perf/compiler tests are extremely labor-intensive to maintain while doing relatively little to catch performance regressions. There are a few issues here:

* some tests aren't very reproducible between runs, meaning that contributors sometimes don't catch regressions in their local validations * many tests aren't very reproducible between platforms and all tests are inconsistent between differing word sizes. This means that we end up having many sets of expected performance numbers in the testsuite. In practice nearly all of these except 64-bit Linux are out-of-date. * our window-based acceptance criterion for performance metrics doesn't catch most regressions, which typically bump allocations by a couple percent or less (whereas the acceptance thresholds range from 5% to 20%). This means that the testsuite fails to catch many deltas, only failing when some unlucky person finally pushes the number over the threshold.

Joachim and I discussed this issue a few months ago at Hac Phi; he had an interesting approach to tracking expected performance numbers which may both alleviate these issues and reduce the maintenance burden that the tests pose. I wrote down some terse notes in #12758.

Cheers,

- Ben

Simon Peyton Jones

8:31 a.m.

| - One of the core issues I see in day to day programming (even though | not necessarily with haskell right now) is that the spare time I | have | to file bug reports, boil down performance regressions etc. and file | them with open source projects is not paid for and hence minimal. | Hence whenever the tools I use make it really easy for me to file a | bug, performance regression or fix something that takes the least | time | the chances of me being able to help out increase greatly. This was | one | of the ideas behind using just pull requests. | E.g. This code seems to be really slow, or has subjectively | regressed in | compilation time. I also feel confident I can legally share this | code | snipped. So I just create a quick pull request with a short | description, | and then carry on with what ever pressing task I’m trying to solve | right | now. There's the same difficulty at the other end too - people who might fix perf regressions are typically not paid for either. So they (eg me) tend to focus on things where there is a small repro case, which in turn costs work to produce. Eg #12745 which I fixed recently in part because thomie found a lovely small example. So I'm a bit concerned that lowering the barrier to entry for perf reports might not actually lead to better perf. (But undeniably the suite we built up would be a Good Thing, so we'd be a bit further forward.) Simon

Moritz Angermann

9 a.m.

...

| - One of the core issues I see in day to day programming (even though | not necessarily with haskell right now) is that the spare time I | have | to file bug reports, boil down performance regressions etc. and file | them with open source projects is not paid for and hence minimal. | Hence whenever the tools I use make it really easy for me to file a | bug, performance regression or fix something that takes the least | time | the chances of me being able to help out increase greatly. This was | one | of the ideas behind using just pull requests. | E.g. This code seems to be really slow, or has subjectively | regressed in | compilation time. I also feel confident I can legally share this | code | snipped. So I just create a quick pull request with a short | description, | and then carry on with what ever pressing task I’m trying to solve | right | now.

There's the same difficulty at the other end too - people who might fix perf regressions are typically not paid for either. So they (eg me) tend to focus on things where there is a small repro case, which in turn costs work to produce. Eg #12745 which I fixed recently in part because thomie found a lovely small example.

So I'm a bit concerned that lowering the barrier to entry for perf reports might not actually lead to better perf. (But undeniably the suite we built up would be a Good Thing, so we'd be a bit further forward.)

Simon

I did not intend to imply that there was a surplus of time on the other end :) If this would result in a bunch of tiny test cases that can pinpoint the underlying issue, I’m not certain. Say we would tag the test cases though (e.g. uses TH, uses GADTs, uses X, Y and Z) and run these samples on every commit or every other commit (what ever the available hardware would allow the test suite to run on (and maybe even backtest where possible)) regressions w.r.t. subsets might be identifiable. E.g. commit <hash> made testcases predominantly with GADTs spike. Worst case scenario we have to declare defeat and decide that this approach has not produced any viable results, and we wasted time of contributes providing the samples. On the other hand we would never know without the samples, as they would have never been provided in the first place? Cheers, moritz

Michal Terepeta

7:27 p.m.

...

On Tue, Dec 6, 2016 at 2:44 AM Ben Gamari wrote: Michal Terepeta writes:

[...]

...
Looking at the comments on the proposal from Moritz, most people would prefer to extend/improve nofib or `tests/perf/compiler` tests. So I guess the main question is - what would be better: - Extending nofib with modules that are compile only (i.e., not runnable) and focus on stressing the compiler? - Extending `tests/perf/compiler` with ability to run all the tests and

do

...
easy "before and after" comparisons?

I don't have a strong opinion on which of these would be better. However, I would point out that currently the tests/perf/compiler tests are extremely labor-intensive to maintain while doing relatively little to catch performance regressions. There are a few issues here:

* some tests aren't very reproducible between runs, meaning that contributors sometimes don't catch regressions in their local validations * many tests aren't very reproducible between platforms and all tests are inconsistent between differing word sizes. This means that we end up having many sets of expected performance numbers in the testsuite. In practice nearly all of these except 64-bit Linux are out-of-date. * our window-based acceptance criterion for performance metrics doesn't catch most regressions, which typically bump allocations by a couple percent or less (whereas the acceptance thresholds range from 5% to 20%). This means that the testsuite fails to catch many deltas, only failing when some unlucky person finally pushes the number over the threshold.

Joachim and I discussed this issue a few months ago at Hac Phi; he had an interesting approach to tracking expected performance numbers which may both alleviate these issues and reduce the maintenance burden that the tests pose. I wrote down some terse notes in #12758.

Thanks for mentioning the ticket! To be honest, I'm not a huge fan of having performance tests being treated the same as any other tests. IMHO they are quite different: - They usually need a quiet environment (e.g., cannot run two different tests at the same time). But with ordinary correctness tests, I can run as many as I want concurrently. - The output is not really binary (correct vs incorrect) but some kind of a number (or collection of numbers) that we want to track over time. - The decision whether to fail is harder. Since output might be noisy, you need to have either quite relaxed bounds (and miss small regressions) or try to enforce stronger bounds (and suffer from the flakiness and maintenance overhead). So for the purpose of: "I have a small change and want to check its effect on compiler performance and expect, e.g., ~1% difference" the model running of benchmarks separately from tests is much nicer. I can run them when I'm not doing anything else on the computer and then easily compare the results. (that's what I usually do for nofib). For tracking the performance over time, one could set something up to run the benchmarks when idle. (isn't that's what perf.haskell.org is doing?) Due to that, if we want to extend tests/perf/compiler to support this use case, I think we should include there benchmarks that are *not* tests (and are not included in ./validate), but there's some easy tool to run all of them and give you a quick comparison of what's changed. To a certain degree this would be then orthogonal to the improvements suggested in the ticket. But we could probably reuse some things (e.g., dumping .csv files for perf metrics?) How should we proceed? Should I open a new ticket focused on this? (maybe we could try to figure out all the details there?) Thanks, Michal

Ben Gamari

9:09 p.m.

Michal Terepeta writes:

...

...
On Tue, Dec 6, 2016 at 2:44 AM Ben Gamari wrote:

I don't have a strong opinion on which of these would be better. However, I would point out that currently the tests/perf/compiler tests are extremely labor-intensive to maintain while doing relatively little to catch performance regressions. There are a few issues here:

* some tests aren't very reproducible between runs, meaning that contributors sometimes don't catch regressions in their local validations * many tests aren't very reproducible between platforms and all tests are inconsistent between differing word sizes. This means that we end up having many sets of expected performance numbers in the testsuite. In practice nearly all of these except 64-bit Linux are out-of-date. * our window-based acceptance criterion for performance metrics doesn't catch most regressions, which typically bump allocations by a couple percent or less (whereas the acceptance thresholds range from 5% to 20%). This means that the testsuite fails to catch many deltas, only failing when some unlucky person finally pushes the number over the threshold.

Joachim and I discussed this issue a few months ago at Hac Phi; he had an interesting approach to tracking expected performance numbers which may both alleviate these issues and reduce the maintenance burden that the tests pose. I wrote down some terse notes in #12758.

Thanks for mentioning the ticket!

Sure!

...

To be honest, I'm not a huge fan of having performance tests being treated the same as any other tests. IMHO they are quite different:

- They usually need a quiet environment (e.g., cannot run two different tests at the same time). But with ordinary correctness tests, I can run as many as I want concurrently.

This is absolutely true; if I had a nickel for every time I saw the testsuite fail, only to pass upon re-running I would be able to fund a great deal of GHC development ;)

...

- The output is not really binary (correct vs incorrect) but some kind of a number (or collection of numbers) that we want to track over time.

Yes, and this is more or less the idea which the ticket is supposed to capture; we track performance numbers in the GHC repository in git notes and have Harbormaster (or some other stable test environment) maintain them. Exact metrics would be recorded for every commit and we could warn during validate if something changes suspiciously (e.g. look at the mean and variance of the metric over the past N commits and squawk if the commit bumps the metric more than some number of sigmas). This sort of scheme could be implemented in either the testsuite or nofib. It's not clear that one is better than the other (although we would want to teach the testsuite driver to run performance tests serially).

...

- The decision whether to fail is harder. Since output might be noisy, you need to have either quite relaxed bounds (and miss small regressions) or try to enforce stronger bounds (and suffer from the flakiness and maintenance overhead).

Yep. That is right.

...

So for the purpose of: "I have a small change and want to check its effect on compiler performance and expect, e.g., ~1% difference" the model running of benchmarks separately from tests is much nicer. I can run them when I'm not doing anything else on the computer and then easily compare the results. (that's what I usually do for nofib). For tracking the performance over time, one could set something up to run the benchmarks when idle. (isn't that's what perf.haskell.org is doing?)

Due to that, if we want to extend tests/perf/compiler to support this use case, I think we should include there benchmarks that are *not* tests (and are not included in ./validate), but there's some easy tool to run all of them and give you a quick comparison of what's changed.

When you put it like this it does sound like nofib is the natural choice here.

...

To a certain degree this would be then orthogonal to the improvements suggested in the ticket. But we could probably reuse some things (e.g., dumping .csv files for perf metrics?)

Indeed.

...

How should we proceed? Should I open a new ticket focused on this? (maybe we could try to figure out all the details there?)

That sounds good to me. Cheers, - Ben

Michal Terepeta

7 Dec 7 Dec

7:27 p.m.

On Tue, Dec 6, 2016 at 10:10 PM Ben Gamari wrote:

...

[...]

...
How should we proceed? Should I open a new ticket focused on this? (maybe we could try to figure out all the details there?)

That sounds good to me.

Cool, opened: https://ghc.haskell.org/trac/ghc/ticket/12941 to track this. Cheers, Michal

Joachim Breitner

6 Dec 6 Dec

9:20 p.m.

Hi, Am Dienstag, den 06.12.2016, 19:27 +0000 schrieb Michal Terepeta:

...

(isn't that's what perf.haskell.org is doing?)

for compiler performance, it only reports the test suite perf test number so far. If someone modifies the nofib runner to give usable timing results for the compiler, I can easily track these numbers as well. Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • https://www.joachim-breitner.de/ XMPP: nomeata@joachim-breitner.de • OpenPGP-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

Ben Gamari

10:14 p.m.

Joachim Breitner writes:

...

Hi,

Am Dienstag, den 06.12.2016, 19:27 +0000 schrieb Michal Terepeta:

...
(isn't that's what perf.haskell.org is doing?)

for compiler performance, it only reports the test suite perf test number so far.

If someone modifies the nofib runner to give usable timing results for the compiler, I can easily track these numbers as well.

I have a module [1] that does precisely this for the PITA project (which I still have yet to put up on a public server; I'll try to make time for this soon). Cheers, - Ben [1] https://github.com/bgamari/ghc-perf-import/blob/master/SummarizeResults.hs

Joachim Breitner

10:20 p.m.

Hi, Am Dienstag, den 06.12.2016, 17:14 -0500 schrieb Ben Gamari:

...

Joachim Breitner writes:

...
Hi,

Am Dienstag, den 06.12.2016, 19:27 +0000 schrieb Michal Terepeta:

...
(isn't that's what perf.haskell.org is doing?)

for compiler performance, it only reports the test suite perf test number so far.

If someone modifies the nofib runner to give usable timing results for the compiler, I can easily track these numbers as well.

I have a module [1] that does precisely this for the PITA project (which I still have yet to put up on a public server; I'll try to make time for this soon).

Are you saying that the compile time measurements of a single run of the compiler are actually useful? I’d expect we first have to make nofib call the compiler repeatedly. Also, shouldn’t this then become part of nofib-analye? Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • https://www.joachim-breitner.de/ XMPP: nomeata@joachim-breitner.de • OpenPGP-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

Ben Gamari

11:51 p.m.

Joachim Breitner writes:

...

Hi,

Am Dienstag, den 06.12.2016, 17:14 -0500 schrieb Ben Gamari:

...
Joachim Breitner writes:

...
Hi,

Am Dienstag, den 06.12.2016, 19:27 +0000 schrieb Michal Terepeta:

...
(isn't that's what perf.haskell.org is doing?)

for compiler performance, it only reports the test suite perf test number so far.

If someone modifies the nofib runner to give usable timing results for the compiler, I can easily track these numbers as well.

I have a module [1] that does precisely this for the PITA project (which I still have yet to put up on a public server; I'll try to make time for this soon).

Are you saying that the compile time measurements of a single run of the compiler are actually useful?

Not really, I generally ignore the compile times. However, knowing compiler allocations on a per-module basis is quite nice.

...

I’d expect we first have to make nofib call the compiler repeatedly.

This would be a good idea though.

...

Also, shouldn’t this then become part of nofib-analye?

The logic for producing these statistics is implemented by nofib-analyse's Slurp module today. All the script does is produce the statistics in a more consistent format. Cheers, - Ben

Ben Gamari

1:30 a.m.

Michal Terepeta writes:

...

Hi everyone,

I've been running nofib a few times recently to see the effect of some changes on compile time (not the runtime of the compiled program). And I've started wondering how representative nofib is when it comes to measuring compile time and compiler allocations? It seems that most of the nofib programs compile really quickly...

Is there some collections of modules/libraries/applications that were put together with the purpose of benchmarking GHC itself and I just haven't seen/found it?

Sadly no; I've put out a number of calls for minimal programs (e.g. small, fairly free-standing real-world applications) but the response hasn't been terribly strong. I frankly can't blame people for not wanting to take the time to strip out dependencies from their working programs. Joachim and I have previously discussed the possibility of manually collecting a set of popular Hackage libraries on a regular basis for use in compiler performance characterization. Cheers, - Ben

3146

Age (days ago)

3149

Last active (days ago)

List overview

Download

19 comments

7 participants

participants (7)

Alan & Kim Zimmerman
Ben Gamari
David Turner
Joachim Breitner
Michal Terepeta
Moritz Angermann
Simon Peyton Jones