Testing of GHC extensions & optimizations

Hi, For those familiar with GHC source code & internals, how are extensions & optimizations tested? And what are the quality policies for accepting new code into GHC? I am interested in testing compilers in general using random testing. Is it used on GHC?

Hi,
Here are a few things we do regarding compiler/runtime performance:
- Each commit goes through some set of tests, some of which also check max.
residency, total allocations etc. of the compiler or the compiled program,
and fail if those numbers are more than the allowed amount. See [1] for an
example.
- There's https://perf.haskell.org/ghc/ which does some testing on every
commit. I don't know what exactly it's doing (hard to tell from the web page,
but I guess it's only running a few select tests/benchmarks?). I've
personally never used it, I just know that it exists.
- Most of the time if a patch is expected to change compiler or runtime
performance the author submits nofib results and updates the perf tests in the
test suite for new numbers. This process is manual and sometimes contributors
are asked for nofib numbers by reviewers etc. See [2,3] for nofib.
We currently don't use random testing.
[1]: https://github.com/ghc/ghc/blob/565ef4cc036905f9f9801c1e775236bb007b026c/tes...
[2]: https://github.com/ghc/nofib
[3]: https://ghc.haskell.org/trac/ghc/wiki/Building/RunningNoFib
Ömer
Rodrigo Stevaux
Hi,
For those familiar with GHC source code & internals, how are extensions & optimizations tested? And what are the quality policies for accepting new code into GHC?
I am interested in testing compilers in general using random testing. Is it used on GHC?
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Hi Omer, thanks for the reply. The tests you run are for regression
testing, that is, non-functional aspects, is my understanding right? What
about testing that optimizations and extensions are correct from a
functional aspect?
Em sáb, 1 de set de 2018 às 08:32, Ömer Sinan Ağacan
Hi,
Here are a few things we do regarding compiler/runtime performance:
- Each commit goes through some set of tests, some of which also check max. residency, total allocations etc. of the compiler or the compiled program, and fail if those numbers are more than the allowed amount. See [1] for an example.
- There's https://perf.haskell.org/ghc/ which does some testing on every commit. I don't know what exactly it's doing (hard to tell from the web page, but I guess it's only running a few select tests/benchmarks?). I've personally never used it, I just know that it exists.
- Most of the time if a patch is expected to change compiler or runtime performance the author submits nofib results and updates the perf tests in the test suite for new numbers. This process is manual and sometimes contributors are asked for nofib numbers by reviewers etc. See [2,3] for nofib.
We currently don't use random testing.
[1]: https://github.com/ghc/ghc/blob/565ef4cc036905f9f9801c1e775236bb007b026c/tes... [2]: https://github.com/ghc/nofib [3]: https://ghc.haskell.org/trac/ghc/wiki/Building/RunningNoFib
Ömer
Rodrigo Stevaux
, 31 Ağu 2018 Cum, 20:54 tarihinde şunu yazdı: Hi,
For those familiar with GHC source code & internals, how are extensions
& optimizations tested? And what are the quality policies for accepting new code into GHC?
I am interested in testing compilers in general using random testing. Is
it used on GHC?
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Am So., 2. Sep. 2018 um 20:05 Uhr schrieb Rodrigo Stevaux : Hi Omer, thanks for the reply. The tests you run are for regression
testing, that is, non-functional aspects, is my understanding right? [...] Quite the opposite, the usual steps are:
* A bug is reported.
* A regression test is added to GHC's test suite, reproducing the bug (
https://ghc.haskell.org/trac/ghc/wiki/Building/RunningTests/Adding).
* The bug is fixed.
This way it is made sure that the bug doesn't come back later. Do this for
a few decades, and you have a very comprehensive test suite for functional
aspects. :-) The reasoning behind this: Blindly adding tests is wasted
effort most of time, because this way you often test things which only very
rarely break: Bugs OTOH hint you very concretely at
problematic/tricky/complicated parts of your SW.
Catching increases in runtime/memory consumption is a slightly different
story, because you have to come up with "typical" scenarios to make useful
comparisons. You can have synthetic scenarios for very specific parts of
the compiler, too, like pattern matching with tons of constructors, or
using gigantic literals, or type checking deeply nested tricky things,
etc., but I am not sure if such things are usually called "regression
tests".
Cheers,
S.

Am 02.09.2018 um 21:58 schrieb Sven Panne:
Quite the opposite, the usual steps are:
* A bug is reported. * A regression test is added to GHC's test suite, reproducing the bug (https://ghc.haskell.org/trac/ghc/wiki/Building/RunningTests/Adding). * The bug is fixed.
This way it is made sure that the bug doesn't come back later.
That's just the... non-thinking aspect, and more embarrassment avoidance. The first level of automated testing.
Do this for a few decades, and you have a very comprehensive test suite for functional aspects. :-) The reasoning behind this: Blindly adding tests is wasted effort most of time, because this way you often test things which only very rarely break: Bugs OTOH hint you very concretely at problematic/tricky/complicated parts of your SW.
Well, you have to *think*. You can't just blindly add tests for every bug that was ever reported; you get an every-growing pile of test code, and if the spec changes you need to change the tests. So you need a strategy to curate the test code, and you very much prefer to test for the thing that actually went wrong, not the thing that was reported. I'm pretty sure the GHC guys do, actually; I'm just speaking up so that people don't take this "just add a test whenever a bug occurs" at face value, there's much more to it.
Catching increases in runtime/memory consumption is a slightly different story, because you have to come up with "typical" scenarios to make useful comparisons.
It's just a case where you cannot blindly add a test for every performance regression you see, you have to set up testing beforehand. Which is the exact opposite of what you recommend, so maybe the recommendation shouldn't be taken at face value ;-P
You can have synthetic scenarios for very specific parts of the compiler, too, like pattern matching with tons of constructors, or using gigantic literals, or type checking deeply nested tricky things, etc., but I am not sure if such things are usually called "regression tests".
It's a matter of definition and common usage, but indeed many people associate the term "regression testing" with "let's write a test case whenever we see a bug". This is one of the reasons why I prefer the term "automated testing". It's both more general and encompasses all the things that one does. Oh, and sometimes you even add a test blindly due to a bug report. It's still a good first line of defense, it's just not what you should always do, and never without thinking about an alternative. Regards, Jo

Am So., 2. Sep. 2018 um 22:44 Uhr schrieb Joachim Durchholz < jo@durchholz.org>:
That's just the... non-thinking aspect, and more embarrassment avoidance. The first level of automated testing.
Well, even avoiding embarrassing bugs is extremely valuable. The vast amount of bugs in real-world SW *is* actually highly embarrassing, and even worse: Similar bugs have probably been introduced before. Getting some tricky algorithm wrong is the exception, at least for two reasons: The majority of code is typically very mundane and boring, and people are usually more awake and concentrated when they know that they are writing non-trivial stuff. Of course your mileage varies, depending on the domain, experience of programmers, deadline pressure, etc.
Do this for a few decades, and you have a very comprehensive test suite for functional aspects. :-) The reasoning behind this: Blindly adding tests is wasted effort most of time, because this way you often test things which only very rarely break: Bugs OTOH hint you very concretely at problematic/tricky/complicated parts of your SW.
Well, you have to *think*. You can't just blindly add tests for every bug that was ever reported; you get an every-growing pile of test code, and if the spec changes you need to change the tests. So you need a strategy to curate the test code, and you very much prefer to test for the thing that actually went wrong, not the thing that was reported.
Two things here: I never proposed to add the exact code from the bug report to a test suite. Bug reports are ususally too big and too unspecific, so of course you add a minimal, focused test triggering the buggy behavior. Furthermore: If the spec changes, your tests *must* break, by all means, otherwise: What are the tests actually testing if it's not the spec? Of course only those tests should break which test the changed part of the spec.
It's just a case where you cannot blindly add a test for every performance regression you see, you have to set up testing beforehand. Which is the exact opposite of what you recommend, so maybe the recommendation shouldn't be taken at face value ;-P
This is exactly why I said that these tests are a different story. For performance measurements there is no binary "failed" or "correct" outcome, because typically many tradeoffs are involved (space vs. time etc.). Therefore you have to define what you consider important, measure that, and guard it against regressions. It's a matter of definition and common usage, but indeed many people
associate the term "regression testing" with "let's write a test case whenever we see a bug". [...]
This sounds far too disparaging, and a quite a few companies have a rule like "no bug fix gets committed without an accompanying regression test" for a good reason. People usually have no real clue where their most problematic code is (just like they have no clue where the most performance-critical part is), so having *some* hint (bug report) is far better than guessing without any hint. Cheers, S.

Thanks for the clarification.
What I am hinting at is, the Csmith project caught many bugs in C compilers
by using random testing -- feeding random programs and testing if the
optimizations preserved program behavior.
Haskell, having tens of optimizations, could be a potential application of
the same technique.
I have no familiarity with the GHC or with any compilers in general; I am
just looking for something to study.
My questions in its most direct form is, as in your view, could GHC
optimizations hide bugs that could be potentially be revealed by exploring
program spaces?
Em dom, 2 de set de 2018 às 16:58, Sven Panne
Am So., 2. Sep. 2018 um 20:05 Uhr schrieb Rodrigo Stevaux < roehst@gmail.com>:
Hi Omer, thanks for the reply. The tests you run are for regression testing, that is, non-functional aspects, is my understanding right? [...]
Quite the opposite, the usual steps are:
* A bug is reported. * A regression test is added to GHC's test suite, reproducing the bug ( https://ghc.haskell.org/trac/ghc/wiki/Building/RunningTests/Adding). * The bug is fixed.
This way it is made sure that the bug doesn't come back later. Do this for a few decades, and you have a very comprehensive test suite for functional aspects. :-) The reasoning behind this: Blindly adding tests is wasted effort most of time, because this way you often test things which only very rarely break: Bugs OTOH hint you very concretely at problematic/tricky/complicated parts of your SW.
Catching increases in runtime/memory consumption is a slightly different story, because you have to come up with "typical" scenarios to make useful comparisons. You can have synthetic scenarios for very specific parts of the compiler, too, like pattern matching with tons of constructors, or using gigantic literals, or type checking deeply nested tricky things, etc., but I am not sure if such things are usually called "regression tests".
Cheers, S.

Have a look at Michal Palka's Ph.D. thesis: https://research.chalmers.se/publication/195849 IIRC, his testing revealed several strictness bugs in GHC when compiling with optimization. / Emil Den 2018-09-03 kl. 03:40, skrev Rodrigo Stevaux:
Thanks for the clarification.
What I am hinting at is, the Csmith project caught many bugs in C compilers by using random testing -- feeding random programs and testing if the optimizations preserved program behavior.
Haskell, having tens of optimizations, could be a potential application of the same technique.
I have no familiarity with the GHC or with any compilers in general; I am just looking for something to study.
My questions in its most direct form is, as in your view, could GHC optimizations hide bugs that could be potentially be revealed by exploring program spaces?
Em dom, 2 de set de 2018 às 16:58, Sven Panne
mailto:svenpanne@gmail.com> escreveu: Am So., 2. Sep. 2018 um 20:05 Uhr schrieb Rodrigo Stevaux
mailto:roehst@gmail.com>: Hi Omer, thanks for the reply. The tests you run are for regression testing, that is, non-functional aspects, is my understanding right? [...]
Quite the opposite, the usual steps are:
* A bug is reported. * A regression test is added to GHC's test suite, reproducing the bug (https://ghc.haskell.org/trac/ghc/wiki/Building/RunningTests/Adding). * The bug is fixed.
This way it is made sure that the bug doesn't come back later. Do this for a few decades, and you have a very comprehensive test suite for functional aspects. :-) The reasoning behind this: Blindly adding tests is wasted effort most of time, because this way you often test things which only very rarely break: Bugs OTOH hint you very concretely at problematic/tricky/complicated parts of your SW.
Catching increases in runtime/memory consumption is a slightly different story, because you have to come up with "typical" scenarios to make useful comparisons. You can have synthetic scenarios for very specific parts of the compiler, too, like pattern matching with tons of constructors, or using gigantic literals, or type checking deeply nested tricky things, etc., but I am not sure if such things are usually called "regression tests".
Cheers, S.
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Ok this is the kind of stuff im looking for. This is great. Many thanks for the insight. Em seg, 3 de set de 2018 às 04:08, Emil Axelsson <78emil@gmail.com> escreveu:
Have a look at Michal Palka's Ph.D. thesis:
https://research.chalmers.se/publication/195849
IIRC, his testing revealed several strictness bugs in GHC when compiling with optimization.
/ Emil
Den 2018-09-03 kl. 03:40, skrev Rodrigo Stevaux:
Thanks for the clarification.
What I am hinting at is, the Csmith project caught many bugs in C compilers by using random testing -- feeding random programs and testing if the optimizations preserved program behavior.
Haskell, having tens of optimizations, could be a potential application of the same technique.
I have no familiarity with the GHC or with any compilers in general; I am just looking for something to study.
My questions in its most direct form is, as in your view, could GHC optimizations hide bugs that could be potentially be revealed by exploring program spaces?
Em dom, 2 de set de 2018 às 16:58, Sven Panne
mailto:svenpanne@gmail.com> escreveu: Am So., 2. Sep. 2018 um 20:05 Uhr schrieb Rodrigo Stevaux
mailto:roehst@gmail.com>: Hi Omer, thanks for the reply. The tests you run are for regression testing, that is, non-functional aspects, is my understanding right? [...]
Quite the opposite, the usual steps are:
* A bug is reported. * A regression test is added to GHC's test suite, reproducing the bug (https://ghc.haskell.org/trac/ghc/wiki/Building/RunningTests/Adding ). * The bug is fixed.
This way it is made sure that the bug doesn't come back later. Do this for a few decades, and you have a very comprehensive test suite for functional aspects. :-) The reasoning behind this: Blindly adding tests is wasted effort most of time, because this way you often test things which only very rarely break: Bugs OTOH hint you very concretely at problematic/tricky/complicated parts of your SW.
Catching increases in runtime/memory consumption is a slightly different story, because you have to come up with "typical" scenarios to make useful comparisons. You can have synthetic scenarios for very specific parts of the compiler, too, like pattern matching with tons of constructors, or using gigantic literals, or type checking deeply nested tricky things, etc., but I am not sure if such things are usually called "regression tests".
Cheers, S.
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.
participants (5)
-
Emil Axelsson
-
Joachim Durchholz
-
Rodrigo Stevaux
-
Sven Panne
-
Ömer Sinan Ağacan