
Hi there! Just a quick update on our CI situation. Ben, John, Davean and I have been discussion on CI yesterday, and what we can do about it, as well as some minor notes on why we are frustrated with it. This is an open invitation to anyone who in earnest wants to work on CI. Please come forward and help! We'd be glad to have more people involved! First the good news, over the last few weeks we've seen we *can* improve CI performance quite substantially. And the goal is now to have MR go through CI within at most 3hs. There are some ideas on how to make this even faster, especially on wide (high core count) machines; however that will take a bit more time. Now to the more thorny issue: Stat failures. We do not want GHC to regress, and I believe everyone is on board with that mission. Yet we have just witnessed a train of marge trials all fail due to a -2% regression in a few tests. Thus we've been blocking getting stuff into master for at least another day. This is (in my opinion) not acceptable! We just had five days of nothing working because master was broken and subsequently all CI pipelines kept failing. We have thus effectively wasted a week. While we can mitigate the latter part by enforcing marge for all merges to master (and with faster pipeline turnaround times this might be more palatable than with 9-12h turnaround times -- when you need to get something done! ha!), but that won't help us with issues where marge can't find a set of buildable MRs, because she just keeps hitting a combination of MRs that somehow together increase or decrease metrics. We have three knobs to adjust: - Make GHC build faster / make the testsuite run faster. There is some rather interesting work going on about parallelizing (earlier) during builds. We've also seen that we've wasted enormous amounts of time during darwin builds in the kernel, because of a bug in the testdriver. - Use faster hardware. We've seen that just this can cut windows build times from 220min to 80min. - Reduce the amount of builds. We used to build two pipelines for each marge merge, and if either of both (see below) failed, marge's merge would fail as well. So not only did we build twice as much as we needed, we also increased our chances to hit bogous build failures by 2. We need to do something about this, and I'd advocate for just not making stats fail with marge. Build errors of course, but stat failures, no. And then have a separate dashboard (and Ben has some old code lying around for this, which someone would need to pick up and polish, ...), that tracks GHC's Performance for each commit to master, with easy access from the dashboard to the offending commit. We will also need to consider the implications of synthetic micro benchmarks, as opposed to say building Cabal or other packages, that reflect more real-world experience of users using GHC. I will try to provide a data driven report on GHC's CI on a bi-weekly or month (we will have to see what the costs for writing it up, and the usefulness is) going forward. And my sincere hope is that it will help us better understand our CI situation; instead of just having some vague complaints about it. Cheers, Moritz

No it wasn't. It was about the stat failures described in the next
paragraph. I could have been more clear about that. My apologies!
On Wed, Mar 17, 2021 at 4:14 PM Spiwack, Arnaud
and if either of both (see below) failed, marge's merge would fail as well.
Re: “see below” is this referring to a missing part of your email?

Then I have a question: why are there two pipelines running on each merge
batch?
On Wed, Mar 17, 2021 at 9:22 AM Moritz Angermann
No it wasn't. It was about the stat failures described in the next paragraph. I could have been more clear about that. My apologies!
On Wed, Mar 17, 2021 at 4:14 PM Spiwack, Arnaud
wrote: and if either of both (see below) failed, marge's merge would fail as
well.
Re: “see below” is this referring to a missing part of your email?

*why* is a very good question. The MR fixing it is here:
https://gitlab.haskell.org/ghc/ghc/-/merge_requests/5275
On Wed, Mar 17, 2021 at 4:26 PM Spiwack, Arnaud
Then I have a question: why are there two pipelines running on each merge batch?
On Wed, Mar 17, 2021 at 9:22 AM Moritz Angermann < moritz.angermann@gmail.com> wrote:
No it wasn't. It was about the stat failures described in the next paragraph. I could have been more clear about that. My apologies!
On Wed, Mar 17, 2021 at 4:14 PM Spiwack, Arnaud
wrote: and if either of both (see below) failed, marge's merge would fail as
well.
Re: “see below” is this referring to a missing part of your email?

Ah, so it was really two identical pipelines (one for the branch where
Margebot batches commits, and one for the MR that Margebot creates before
merging). That's indeed a non-trivial amount of purely wasted
computer-hours.
Taking a step back, I am inclined to agree with the proposal of not
checking stat regressions in Margebot. My high-level opinion on this is
that perf tests don't actually test the right thing. Namely, they don't
prevent performance drift over time (if a given test is allowed to degrade
by 2% every commit, it can take a 100% performance hit in just 35 commits).
While it is important to measure performance, and to avoid too egregious
performance degradation in a given commit, it's usually performance over
time which matters. I don't really know how to apply it to collaborative
development, and help maintain healthy performance. But flagging
performance regressions in MRs, while not making them block batched merges
sounds like a reasonable compromise.
On Wed, Mar 17, 2021 at 9:34 AM Moritz Angermann
*why* is a very good question. The MR fixing it is here: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/5275
On Wed, Mar 17, 2021 at 4:26 PM Spiwack, Arnaud
wrote: Then I have a question: why are there two pipelines running on each merge batch?
On Wed, Mar 17, 2021 at 9:22 AM Moritz Angermann < moritz.angermann@gmail.com> wrote:
No it wasn't. It was about the stat failures described in the next paragraph. I could have been more clear about that. My apologies!
On Wed, Mar 17, 2021 at 4:14 PM Spiwack, Arnaud
wrote: and if either of both (see below) failed, marge's merge would fail as
well.
Re: “see below” is this referring to a missing part of your email?

I am not advocating to drop perf tests during merge requests, I just want
them to not be fatal for marge batches. Yes this means that a bunch of
unrelated merge requests all could be fine wrt to the perf checks per merge
request, but the aggregate might fail perf. And then subsequently the next
MR against the merged aggregate will start failing. Even that is a pretty
bad situation imo.
I honestly don't have a good answer, I just see marge work on batches, over
and over and over again, just to fail. Eventually marge should figure out a
subset of the merges that fit into the perf window, but that might be after
10 tries? So after up to ~30+hours?, which means there won't be any merge
request landing in GHC for 30hs. I find that rather unacceptable.
I think we need better visualisation of perf regressions that happen on
master. Ben has some wip for this, and I think John said there might be
some way to add a nice (maybe reflex) ui to it. If we can see regressions
on master easily, and go from "ohh this point in time GHC got worse", to
"this is the commit". We might be able to figure it out.
But what do we expect of patch authors? Right now if five people write
patches to GHC, and each of them eventually manage to get their MRs green,
after a long review, they finally see it assigned to marge, and then it
starts failing? Their patch on its own was fine, but their aggregate with
other people's code leads to regressions? So we now expect all patch
authors together to try to figure out what happened? Figuring out why
something regressed is hard enough, and we only have a very few people who
are actually capable of debugging this. Thus I believe it would end up with
Ben, Andreas, Matthiew, Simon, ... or someone else from GHC HQ anyway to
figure out why it regressed, be it in the Review Stage, or dissecting a
marge aggregate, or on master.
Thus I believe in most cases we'd have to look at the regressions anyway,
and right now we just convolutedly make working on GHC a rather depressing
job. Increasing the barrier to entry by also requiring everyone to have
absolutely stellar perf regression skills is quite a challenge.
There is also the question of our synthetic benchmarks actually measuring
real world performance? Do the micro benchmarks translate to the same
regressions in say building aeson, vector or Cabal? The latter being what
most practitioners care about more than the micro benchmarks.
Again, I'm absolutely not in favour of GHC regressing, it's slow enough as
it is. I just think CI should be assisting us and not holding development
back.
Cheers,
Moritz
On Wed, Mar 17, 2021 at 5:54 PM Spiwack, Arnaud
Ah, so it was really two identical pipelines (one for the branch where Margebot batches commits, and one for the MR that Margebot creates before merging). That's indeed a non-trivial amount of purely wasted computer-hours.
Taking a step back, I am inclined to agree with the proposal of not checking stat regressions in Margebot. My high-level opinion on this is that perf tests don't actually test the right thing. Namely, they don't prevent performance drift over time (if a given test is allowed to degrade by 2% every commit, it can take a 100% performance hit in just 35 commits). While it is important to measure performance, and to avoid too egregious performance degradation in a given commit, it's usually performance over time which matters. I don't really know how to apply it to collaborative development, and help maintain healthy performance. But flagging performance regressions in MRs, while not making them block batched merges sounds like a reasonable compromise.
On Wed, Mar 17, 2021 at 9:34 AM Moritz Angermann < moritz.angermann@gmail.com> wrote:
*why* is a very good question. The MR fixing it is here: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/5275
On Wed, Mar 17, 2021 at 4:26 PM Spiwack, Arnaud
wrote: Then I have a question: why are there two pipelines running on each merge batch?
On Wed, Mar 17, 2021 at 9:22 AM Moritz Angermann < moritz.angermann@gmail.com> wrote:
No it wasn't. It was about the stat failures described in the next paragraph. I could have been more clear about that. My apologies!
On Wed, Mar 17, 2021 at 4:14 PM Spiwack, Arnaud < arnaud.spiwack@tweag.io> wrote:
and if either of both (see below) failed, marge's merge would fail as
well.
Re: “see below” is this referring to a missing part of your email?

On Mar 17, 2021, at 6:18 AM, Moritz Angermann
wrote: But what do we expect of patch authors? Right now if five people write patches to GHC, and each of them eventually manage to get their MRs green, after a long review, they finally see it assigned to marge, and then it starts failing? Their patch on its own was fine, but their aggregate with other people's code leads to regressions? So we now expect all patch authors together to try to figure out what happened? Figuring out why something regressed is hard enough, and we only have a very few people who are actually capable of debugging this. Thus I believe it would end up with Ben, Andreas, Matthiew, Simon, ... or someone else from GHC HQ anyway to figure out why it regressed, be it in the Review Stage, or dissecting a marge aggregate, or on master.
I have previously posted against the idea of allowing Marge to accept regressions... but the paragraph above is sadly convincing. Maybe Simon is right about opening up the windows to, say, be 100% (which would catch a 10x regression) instead of infinite, but I'm now convinced that Marge should be very generous in allowing regressions -- provided we also have some way of monitoring drift over time. Separately, I've been concerned for some time about the peculiarity of our perf tests. For example, I'd be quite happy to accept a 25% regression on T9872c if it yielded a 1% improvement on compiling Cabal. T9872 is very very very strange! (Maybe if *all* the T9872 tests regressed, I'd be more worried.) I would be very happy to learn that some more general, representative tests are included in our examinations. Richard

Re: Performance drift: I opened https://gitlab.haskell.org/ghc/ghc/-/issues/17658 a while ago with an idea of how to measure drift a bit better. It's basically an automatically checked version of "Ben stares at performance reports every two weeks and sees that T9872 has regressed by 10% since 9.0" Maybe we can have Marge check for drift and each individual MR for incremental perf regressions? Sebastian Am Mi., 17. März 2021 um 14:40 Uhr schrieb Richard Eisenberg < rae@richarde.dev>:
On Mar 17, 2021, at 6:18 AM, Moritz Angermann
wrote: But what do we expect of patch authors? Right now if five people write patches to GHC, and each of them eventually manage to get their MRs green, after a long review, they finally see it assigned to marge, and then it starts failing? Their patch on its own was fine, but their aggregate with other people's code leads to regressions? So we now expect all patch authors together to try to figure out what happened? Figuring out why something regressed is hard enough, and we only have a very few people who are actually capable of debugging this. Thus I believe it would end up with Ben, Andreas, Matthiew, Simon, ... or someone else from GHC HQ anyway to figure out why it regressed, be it in the Review Stage, or dissecting a marge aggregate, or on master.
I have previously posted against the idea of allowing Marge to accept regressions... but the paragraph above is sadly convincing. Maybe Simon is right about opening up the windows to, say, be 100% (which would catch a 10x regression) instead of infinite, but I'm now convinced that Marge should be very generous in allowing regressions -- provided we also have some way of monitoring drift over time.
Separately, I've been concerned for some time about the peculiarity of our perf tests. For example, I'd be quite happy to accept a 25% regression on T9872c if it yielded a 1% improvement on compiling Cabal. T9872 is very very very strange! (Maybe if *all* the T9872 tests regressed, I'd be more worried.) I would be very happy to learn that some more general, representative tests are included in our examinations.
Richard _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Yes, I think the counter point of "automating what Ben does" so people besides Ben can do it is very important. In this case, I think a good thing we could do is asynchronously build more of master post-merge, such as use the perf stats to automatically bisect anything that is fishy, including within marge bot roll-ups which wouldn't be built by the regular workflow anyways. I also agree with Sebastian that the overfit/overly-synthetic nature of our current tests + the sketchy way we ignored drift makes the current approach worth abandoning in any event. The fact that the gold standard must include tests of larger, "real world" code, which unfortunately takes longer to build, I also think is a point towards this asynchronous approach: We trade MR latency for stat latency, but better utilize our build machines and get better stats, and when a human is to fix something a few days later, they have a much better foundation to start their investigation. Finally I agree with SPJ that for fairness and sustainability's sake, the person investigating issues after the fact should ideally be the MR authors, and definitely definitely not Ben. But I hope that better stats, nice looking graphs, and maybe a system to automatically ping MR authors, will make the perf debugging much more accessible enabling that goal. John On 3/17/21 9:47 AM, Sebastian Graf wrote:
Re: Performance drift: I opened https://gitlab.haskell.org/ghc/ghc/-/issues/17658 https://gitlab.haskell.org/ghc/ghc/-/issues/17658 a while ago with an idea of how to measure drift a bit better. It's basically an automatically checked version of "Ben stares at performance reports every two weeks and sees that T9872 has regressed by 10% since 9.0"
Maybe we can have Marge check for drift and each individual MR for incremental perf regressions?
Sebastian
Am Mi., 17. März 2021 um 14:40 Uhr schrieb Richard Eisenberg
mailto:rae@richarde.dev>: On Mar 17, 2021, at 6:18 AM, Moritz Angermann
mailto:moritz.angermann@gmail.com> wrote: But what do we expect of patch authors? Right now if five people write patches to GHC, and each of them eventually manage to get their MRs green, after a long review, they finally see it assigned to marge, and then it starts failing? Their patch on its own was fine, but their aggregate with other people's code leads to regressions? So we now expect all patch authors together to try to figure out what happened? Figuring out why something regressed is hard enough, and we only have a very few people who are actually capable of debugging this. Thus I believe it would end up with Ben, Andreas, Matthiew, Simon, ... or someone else from GHC HQ anyway to figure out why it regressed, be it in the Review Stage, or dissecting a marge aggregate, or on master.
I have previously posted against the idea of allowing Marge to accept regressions... but the paragraph above is sadly convincing. Maybe Simon is right about opening up the windows to, say, be 100% (which would catch a 10x regression) instead of infinite, but I'm now convinced that Marge should be very generous in allowing regressions -- provided we also have some way of monitoring drift over time.
Separately, I've been concerned for some time about the peculiarity of our perf tests. For example, I'd be quite happy to accept a 25% regression on T9872c if it yielded a 1% improvement on compiling Cabal. T9872 is very very very strange! (Maybe if *all* the T9872 tests regressed, I'd be more worried.) I would be very happy to learn that some more general, representative tests are included in our examinations.
Richard _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org mailto:ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

I'd be quite happy to accept a 25% regression on T9872c if it yielded a 1% improvement on compiling Cabal. T9872 is very very very strange! (Maybe if *all* the T9872 tests regressed, I'd be more worried.)
While I fully agree with this. We should *always* want to know if a small syntetic benchmark regresses by a lot. Or in other words we don't want CI to accept such a regression for us ever, but the developer of a patch should need to explicitly ok it. Otherwise we just slow down a lot of seldom-used code paths by a lot. Now that isn't really an issue anyway I think. The question is rather is 2% a large enough regression to worry about? 5%? 10%? Cheers, Andreas Am 17/03/2021 um 14:39 schrieb Richard Eisenberg:
On Mar 17, 2021, at 6:18 AM, Moritz Angermann
mailto:moritz.angermann@gmail.com> wrote: But what do we expect of patch authors? Right now if five people write patches to GHC, and each of them eventually manage to get their MRs green, after a long review, they finally see it assigned to marge, and then it starts failing? Their patch on its own was fine, but their aggregate with other people's code leads to regressions? So we now expect all patch authors together to try to figure out what happened? Figuring out why something regressed is hard enough, and we only have a very few people who are actually capable of debugging this. Thus I believe it would end up with Ben, Andreas, Matthiew, Simon, ... or someone else from GHC HQ anyway to figure out why it regressed, be it in the Review Stage, or dissecting a marge aggregate, or on master.
I have previously posted against the idea of allowing Marge to accept regressions... but the paragraph above is sadly convincing. Maybe Simon is right about opening up the windows to, say, be 100% (which would catch a 10x regression) instead of infinite, but I'm now convinced that Marge should be very generous in allowing regressions -- provided we also have some way of monitoring drift over time.
Separately, I've been concerned for some time about the peculiarity of our perf tests. For example, I'd be quite happy to accept a 25% regression on T9872c if it yielded a 1% improvement on compiling Cabal. T9872 is very very very strange! (Maybe if *all* the T9872 tests regressed, I'd be more worried.) I would be very happy to learn that some more general, representative tests are included in our examinations.
Richard
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

On 17 Mar 2021, at 16:16, Andreas Klebinger
While I fully agree with this. We should *always* want to know if a small syntetic benchmark regresses by a lot. Or in other words we don't want CI to accept such a regression for us ever, but the developer of a patch should need to explicitly ok it.
Otherwise we just slow down a lot of seldom-used code paths by a lot.
Now that isn't really an issue anyway I think. The question is rather is 2% a large enough regression to worry about? 5%? 10%?
You probably want a sliding window anyway. Having N 1.8% regressions in a row can still slow things down a lot. While a 3% regression after a 5% improvement is probably fine. - Merijn

On 3/17/21 4:16 PM, Andreas Klebinger wrote:
Now that isn't really an issue anyway I think. The question is rather is 2% a large enough regression to worry about? 5%? 10%?
5-10% is still around system noise even on lightly loaded workstation. Not sure if CI is not run on some shared cloud resources where it may be even higher. I've done simple experiment of pining ghc compiling ghc-cabal and I've been able to "speed" it up by 5-10% on W-2265. Also following this CI/performance regs discussion I'm not entirely sure if this is not just a witch-hunt hurting/beating mostly most active GHC developers. Another idea may be to give up on CI doing perf reg testing at all and invest saved resources into proper investigation of GHC/Haskell programs performance. Not sure, if this would not be more beneficial longer term. Just one random number thrown to the ring. Linux's perf claims that nearly every second L3 cache access on the example above ends with cache miss. Is it a good number or bad number? See stats below (perf stat -d on ghc with +RTS -T -s -RTS'). Good luck to anybody working on that! Karel Linking utils/ghc-cabal/dist/build/tmp/ghc-cabal ... 61,020,836,136 bytes allocated in the heap 5,229,185,608 bytes copied during GC 301,742,768 bytes maximum residency (19 sample(s)) 3,533,000 bytes maximum slop 840 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 2012 colls, 0 par 5.725s 5.731s 0.0028s 0.1267s Gen 1 19 colls, 0 par 1.695s 1.696s 0.0893s 0.2636s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.000s ( 0.000s elapsed) MUT time 27.849s ( 32.163s elapsed) GC time 7.419s ( 7.427s elapsed) EXIT time 0.000s ( 0.010s elapsed) Total time 35.269s ( 39.601s elapsed) Alloc rate 2,191,122,004 bytes per MUT second Productivity 79.0% of total user, 81.2% of total elapsed Performance counter stats for '/export/home/karel/sfw/ghc-8.10.3/bin/ghc -H32m -O -Wall -optc-Wall -O0 -hide-all-packages -package ghc-prim -package base -package binary -package array -package transformers -package time -package containers -package bytestring -package deepseq -package process -package pretty -package directory -package filepath -package template-haskell -package unix --make utils/ghc-cabal/Main.hs -o utils/ghc-cabal/dist/build/tmp/ghc-cabal -no-user-package-db -Wall -fno-warn-unused-imports -fno-warn-warnings-deprecations -DCABAL_VERSION=3,4,0,0 -DBOOTSTRAPPING -odir bootstrapping -hidir bootstrapping libraries/Cabal/Cabal/Distribution/Fields/Lexer.hs -ilibraries/Cabal/Cabal -ilibraries/binary/src -ilibraries/filepath -ilibraries/hpc -ilibraries/mtl -ilibraries/text/src libraries/text/cbits/cbits.c -Ilibraries/text/include -ilibraries/parsec/src +RTS -T -s -RTS': 39,632.99 msec task-clock # 0.999 CPUs utilized 17,191 context-switches # 0.434 K/sec 0 cpu-migrations # 0.000 K/sec 899,930 page-faults # 0.023 M/sec 177,636,979,975 cycles # 4.482 GHz (87.54%) 181,945,795,221 instructions # 1.02 insn per cycle (87.59%) 34,033,574,511 branches # 858.718 M/sec (87.42%) 1,664,969,299 branch-misses # 4.89% of all branches (87.48%) 41,522,737,426 L1-dcache-loads # 1047.681 M/sec (87.53%) 2,675,319,939 L1-dcache-load-misses # 6.44% of all L1-dcache hits (87.48%) 372,370,395 LLC-loads # 9.395 M/sec (87.49%) 173,614,140 LLC-load-misses # 46.62% of all LL-cache hits (87.46%) 39.663103602 seconds time elapsed 38.288158000 seconds user 1.358263000 seconds sys

That really shouldn't be near system noise for a well constructed
performance test. You might be seeing things like thermal issues, etc
though - good benchmarking is a serious subject.
Also we're not talking wall clock tests, we're talking specific metrics.
The machines do tend to be bare metal, but many of these are entirely CPU
performance independent, memory timing independent, etc. Well not quite but
that's a longer discussion.
The investigation of Haskell code performance is a very good thing to do
BTW, but you'd still want to avoid regressions in the improvements you
made. How well we can do that and the cost of it is the primary issue here.
-davean
On Wed, Mar 17, 2021 at 6:22 PM Karel Gardas
On 3/17/21 4:16 PM, Andreas Klebinger wrote:
Now that isn't really an issue anyway I think. The question is rather is 2% a large enough regression to worry about? 5%? 10%?
5-10% is still around system noise even on lightly loaded workstation. Not sure if CI is not run on some shared cloud resources where it may be even higher.
I've done simple experiment of pining ghc compiling ghc-cabal and I've been able to "speed" it up by 5-10% on W-2265.
Also following this CI/performance regs discussion I'm not entirely sure if this is not just a witch-hunt hurting/beating mostly most active GHC developers. Another idea may be to give up on CI doing perf reg testing at all and invest saved resources into proper investigation of GHC/Haskell programs performance. Not sure, if this would not be more beneficial longer term.
Just one random number thrown to the ring. Linux's perf claims that nearly every second L3 cache access on the example above ends with cache miss. Is it a good number or bad number? See stats below (perf stat -d on ghc with +RTS -T -s -RTS').
Good luck to anybody working on that!
Karel
Linking utils/ghc-cabal/dist/build/tmp/ghc-cabal ... 61,020,836,136 bytes allocated in the heap 5,229,185,608 bytes copied during GC 301,742,768 bytes maximum residency (19 sample(s)) 3,533,000 bytes maximum slop 840 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 2012 colls, 0 par 5.725s 5.731s 0.0028s 0.1267s Gen 1 19 colls, 0 par 1.695s 1.696s 0.0893s 0.2636s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.000s elapsed) MUT time 27.849s ( 32.163s elapsed) GC time 7.419s ( 7.427s elapsed) EXIT time 0.000s ( 0.010s elapsed) Total time 35.269s ( 39.601s elapsed)
Alloc rate 2,191,122,004 bytes per MUT second
Productivity 79.0% of total user, 81.2% of total elapsed
Performance counter stats for '/export/home/karel/sfw/ghc-8.10.3/bin/ghc -H32m -O -Wall -optc-Wall -O0 -hide-all-packages -package ghc-prim -package base -package binary -package array -package transformers -package time -package containers -package bytestring -package deepseq -package process -package pretty -package directory -package filepath -package template-haskell -package unix --make utils/ghc-cabal/Main.hs -o utils/ghc-cabal/dist/build/tmp/ghc-cabal -no-user-package-db -Wall -fno-warn-unused-imports -fno-warn-warnings-deprecations -DCABAL_VERSION=3,4,0,0 -DBOOTSTRAPPING -odir bootstrapping -hidir bootstrapping libraries/Cabal/Cabal/Distribution/Fields/Lexer.hs -ilibraries/Cabal/Cabal -ilibraries/binary/src -ilibraries/filepath -ilibraries/hpc -ilibraries/mtl -ilibraries/text/src libraries/text/cbits/cbits.c -Ilibraries/text/include -ilibraries/parsec/src +RTS -T -s -RTS':
39,632.99 msec task-clock # 0.999 CPUs utilized 17,191 context-switches # 0.434 K/sec
0 cpu-migrations # 0.000 K/sec
899,930 page-faults # 0.023 M/sec
177,636,979,975 cycles # 4.482 GHz (87.54%) 181,945,795,221 instructions # 1.02 insn per cycle (87.59%) 34,033,574,511 branches # 858.718 M/sec (87.42%) 1,664,969,299 branch-misses # 4.89% of all branches (87.48%) 41,522,737,426 L1-dcache-loads # 1047.681 M/sec (87.53%) 2,675,319,939 L1-dcache-load-misses # 6.44% of all L1-dcache hits (87.48%) 372,370,395 LLC-loads # 9.395 M/sec (87.49%) 173,614,140 LLC-load-misses # 46.62% of all LL-cache hits (87.46%)
39.663103602 seconds time elapsed
38.288158000 seconds user 1.358263000 seconds sys _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

To be clear: All performance tests that run as part of CI measure
allocations only. No wall clock time.
Those measurements are (mostly) deterministic and reproducible between
compiles of the same worktree and not impacted by thermal issues/hardware
at all.
Am Do., 18. März 2021 um 18:09 Uhr schrieb davean
That really shouldn't be near system noise for a well constructed performance test. You might be seeing things like thermal issues, etc though - good benchmarking is a serious subject. Also we're not talking wall clock tests, we're talking specific metrics. The machines do tend to be bare metal, but many of these are entirely CPU performance independent, memory timing independent, etc. Well not quite but that's a longer discussion.
The investigation of Haskell code performance is a very good thing to do BTW, but you'd still want to avoid regressions in the improvements you made. How well we can do that and the cost of it is the primary issue here.
-davean
On Wed, Mar 17, 2021 at 6:22 PM Karel Gardas
wrote: On 3/17/21 4:16 PM, Andreas Klebinger wrote:
Now that isn't really an issue anyway I think. The question is rather is 2% a large enough regression to worry about? 5%? 10%?
5-10% is still around system noise even on lightly loaded workstation. Not sure if CI is not run on some shared cloud resources where it may be even higher.
I've done simple experiment of pining ghc compiling ghc-cabal and I've been able to "speed" it up by 5-10% on W-2265.
Also following this CI/performance regs discussion I'm not entirely sure if this is not just a witch-hunt hurting/beating mostly most active GHC developers. Another idea may be to give up on CI doing perf reg testing at all and invest saved resources into proper investigation of GHC/Haskell programs performance. Not sure, if this would not be more beneficial longer term.
Just one random number thrown to the ring. Linux's perf claims that nearly every second L3 cache access on the example above ends with cache miss. Is it a good number or bad number? See stats below (perf stat -d on ghc with +RTS -T -s -RTS').
Good luck to anybody working on that!
Karel
Linking utils/ghc-cabal/dist/build/tmp/ghc-cabal ... 61,020,836,136 bytes allocated in the heap 5,229,185,608 bytes copied during GC 301,742,768 bytes maximum residency (19 sample(s)) 3,533,000 bytes maximum slop 840 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 2012 colls, 0 par 5.725s 5.731s 0.0028s 0.1267s Gen 1 19 colls, 0 par 1.695s 1.696s 0.0893s 0.2636s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.000s elapsed) MUT time 27.849s ( 32.163s elapsed) GC time 7.419s ( 7.427s elapsed) EXIT time 0.000s ( 0.010s elapsed) Total time 35.269s ( 39.601s elapsed)
Alloc rate 2,191,122,004 bytes per MUT second
Productivity 79.0% of total user, 81.2% of total elapsed
Performance counter stats for '/export/home/karel/sfw/ghc-8.10.3/bin/ghc -H32m -O -Wall -optc-Wall -O0 -hide-all-packages -package ghc-prim -package base -package binary -package array -package transformers -package time -package containers -package bytestring -package deepseq -package process -package pretty -package directory -package filepath -package template-haskell -package unix --make utils/ghc-cabal/Main.hs -o utils/ghc-cabal/dist/build/tmp/ghc-cabal -no-user-package-db -Wall -fno-warn-unused-imports -fno-warn-warnings-deprecations -DCABAL_VERSION=3,4,0,0 -DBOOTSTRAPPING -odir bootstrapping -hidir bootstrapping libraries/Cabal/Cabal/Distribution/Fields/Lexer.hs -ilibraries/Cabal/Cabal -ilibraries/binary/src -ilibraries/filepath -ilibraries/hpc -ilibraries/mtl -ilibraries/text/src libraries/text/cbits/cbits.c -Ilibraries/text/include -ilibraries/parsec/src +RTS -T -s -RTS':
39,632.99 msec task-clock # 0.999 CPUs utilized 17,191 context-switches # 0.434 K/sec
0 cpu-migrations # 0.000 K/sec
899,930 page-faults # 0.023 M/sec
177,636,979,975 cycles # 4.482 GHz (87.54%) 181,945,795,221 instructions # 1.02 insn per cycle (87.59%) 34,033,574,511 branches # 858.718 M/sec (87.42%) 1,664,969,299 branch-misses # 4.89% of all branches (87.48%) 41,522,737,426 L1-dcache-loads # 1047.681 M/sec (87.53%) 2,675,319,939 L1-dcache-load-misses # 6.44% of all L1-dcache hits (87.48%) 372,370,395 LLC-loads # 9.395 M/sec (87.49%) 173,614,140 LLC-load-misses # 46.62% of all LL-cache hits (87.46%)
39.663103602 seconds time elapsed
38.288158000 seconds user 1.358263000 seconds sys _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

I left the wiggle room for things like longer wall time causing more time
events in the IO Manager/RTS which can be a thermal/HW issue.
They're small and indirect though
-davean
On Thu, Mar 18, 2021 at 1:37 PM Sebastian Graf
To be clear: All performance tests that run as part of CI measure allocations only. No wall clock time. Those measurements are (mostly) deterministic and reproducible between compiles of the same worktree and not impacted by thermal issues/hardware at all.
Am Do., 18. März 2021 um 18:09 Uhr schrieb davean
: That really shouldn't be near system noise for a well constructed performance test. You might be seeing things like thermal issues, etc though - good benchmarking is a serious subject. Also we're not talking wall clock tests, we're talking specific metrics. The machines do tend to be bare metal, but many of these are entirely CPU performance independent, memory timing independent, etc. Well not quite but that's a longer discussion.
The investigation of Haskell code performance is a very good thing to do BTW, but you'd still want to avoid regressions in the improvements you made. How well we can do that and the cost of it is the primary issue here.
-davean
On Wed, Mar 17, 2021 at 6:22 PM Karel Gardas
wrote: On 3/17/21 4:16 PM, Andreas Klebinger wrote:
Now that isn't really an issue anyway I think. The question is rather is 2% a large enough regression to worry about? 5%? 10%?
5-10% is still around system noise even on lightly loaded workstation. Not sure if CI is not run on some shared cloud resources where it may be even higher.
I've done simple experiment of pining ghc compiling ghc-cabal and I've been able to "speed" it up by 5-10% on W-2265.
Also following this CI/performance regs discussion I'm not entirely sure if this is not just a witch-hunt hurting/beating mostly most active GHC developers. Another idea may be to give up on CI doing perf reg testing at all and invest saved resources into proper investigation of GHC/Haskell programs performance. Not sure, if this would not be more beneficial longer term.
Just one random number thrown to the ring. Linux's perf claims that nearly every second L3 cache access on the example above ends with cache miss. Is it a good number or bad number? See stats below (perf stat -d on ghc with +RTS -T -s -RTS').
Good luck to anybody working on that!
Karel
Linking utils/ghc-cabal/dist/build/tmp/ghc-cabal ... 61,020,836,136 bytes allocated in the heap 5,229,185,608 bytes copied during GC 301,742,768 bytes maximum residency (19 sample(s)) 3,533,000 bytes maximum slop 840 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 2012 colls, 0 par 5.725s 5.731s 0.0028s 0.1267s Gen 1 19 colls, 0 par 1.695s 1.696s 0.0893s 0.2636s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.000s elapsed) MUT time 27.849s ( 32.163s elapsed) GC time 7.419s ( 7.427s elapsed) EXIT time 0.000s ( 0.010s elapsed) Total time 35.269s ( 39.601s elapsed)
Alloc rate 2,191,122,004 bytes per MUT second
Productivity 79.0% of total user, 81.2% of total elapsed
Performance counter stats for '/export/home/karel/sfw/ghc-8.10.3/bin/ghc -H32m -O -Wall -optc-Wall -O0 -hide-all-packages -package ghc-prim -package base -package binary -package array -package transformers -package time -package containers -package bytestring -package deepseq -package process -package pretty -package directory -package filepath -package template-haskell -package unix --make utils/ghc-cabal/Main.hs -o utils/ghc-cabal/dist/build/tmp/ghc-cabal -no-user-package-db -Wall -fno-warn-unused-imports -fno-warn-warnings-deprecations -DCABAL_VERSION=3,4,0,0 -DBOOTSTRAPPING -odir bootstrapping -hidir bootstrapping libraries/Cabal/Cabal/Distribution/Fields/Lexer.hs -ilibraries/Cabal/Cabal -ilibraries/binary/src -ilibraries/filepath -ilibraries/hpc -ilibraries/mtl -ilibraries/text/src libraries/text/cbits/cbits.c -Ilibraries/text/include -ilibraries/parsec/src +RTS -T -s -RTS':
39,632.99 msec task-clock # 0.999 CPUs utilized 17,191 context-switches # 0.434 K/sec
0 cpu-migrations # 0.000 K/sec
899,930 page-faults # 0.023 M/sec
177,636,979,975 cycles # 4.482 GHz (87.54%) 181,945,795,221 instructions # 1.02 insn per cycle (87.59%) 34,033,574,511 branches # 858.718 M/sec (87.42%) 1,664,969,299 branch-misses # 4.89% of all branches (87.48%) 41,522,737,426 L1-dcache-loads # 1047.681 M/sec (87.53%) 2,675,319,939 L1-dcache-load-misses # 6.44% of all L1-dcache hits (87.48%) 372,370,395 LLC-loads # 9.395 M/sec (87.49%) 173,614,140 LLC-load-misses # 46.62% of all LL-cache hits (87.46%)
39.663103602 seconds time elapsed
38.288158000 seconds user 1.358263000 seconds sys _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

My guess is most of the "noise" is not run time, but the compiled code changing in hard to predict ways. https://gitlab.haskell.org/ghc/ghc/-/merge_requests/1776/diffs for example was a very small PR that took *months* of on-off work to get passing metrics tests. In the end, binding `is_boot` twice helped a bit, and dumb luck helped a little bit more. No matter how you analyze that, that's a lot of pain for what's manifestly a performance-irrelevant MR --- no one is writing 10,000 default methods or whatever could possibly make this the micro-optimizing worth it! Perhaps this is an extreme example, but my rough sense is that it's not an isolated outlier. John On 3/18/21 1:39 PM, davean wrote:
I left the wiggle room for things like longer wall time causing more time events in the IO Manager/RTS which can be a thermal/HW issue. They're small and indirect though
-davean
On Thu, Mar 18, 2021 at 1:37 PM Sebastian Graf
mailto:sgraf1337@gmail.com> wrote: To be clear: All performance tests that run as part of CI measure allocations only. No wall clock time. Those measurements are (mostly) deterministic and reproducible between compiles of the same worktree and not impacted by thermal issues/hardware at all.
Am Do., 18. März 2021 um 18:09 Uhr schrieb davean
mailto:davean@xkcd.com>: That really shouldn't be near system noise for a well constructed performance test. You might be seeing things like thermal issues, etc though - good benchmarking is a serious subject. Also we're not talking wall clock tests, we're talking specific metrics. The machines do tend to be bare metal, but many of these are entirely CPU performance independent, memory timing independent, etc. Well not quite but that's a longer discussion.
The investigation of Haskell code performance is a very good thing to do BTW, but you'd still want to avoid regressions in the improvements you made. How well we can do that and the cost of it is the primary issue here.
-davean
On Wed, Mar 17, 2021 at 6:22 PM Karel Gardas
mailto:karel.gardas@centrum.cz> wrote: On 3/17/21 4:16 PM, Andreas Klebinger wrote: > Now that isn't really an issue anyway I think. The question is rather is > 2% a large enough regression to worry about? 5%? 10%?
5-10% is still around system noise even on lightly loaded workstation. Not sure if CI is not run on some shared cloud resources where it may be even higher.
I've done simple experiment of pining ghc compiling ghc-cabal and I've been able to "speed" it up by 5-10% on W-2265.
Also following this CI/performance regs discussion I'm not entirely sure if this is not just a witch-hunt hurting/beating mostly most active GHC developers. Another idea may be to give up on CI doing perf reg testing at all and invest saved resources into proper investigation of GHC/Haskell programs performance. Not sure, if this would not be more beneficial longer term.
Just one random number thrown to the ring. Linux's perf claims that nearly every second L3 cache access on the example above ends with cache miss. Is it a good number or bad number? See stats below (perf stat -d on ghc with +RTS -T -s -RTS').
Good luck to anybody working on that!
Karel
Linking utils/ghc-cabal/dist/build/tmp/ghc-cabal ... 61,020,836,136 bytes allocated in the heap 5,229,185,608 bytes copied during GC 301,742,768 bytes maximum residency (19 sample(s)) 3,533,000 bytes maximum slop 840 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 2012 colls, 0 par 5.725s 5.731s 0.0028s 0.1267s Gen 1 19 colls, 0 par 1.695s 1.696s 0.0893s 0.2636s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.000s elapsed) MUT time 27.849s ( 32.163s elapsed) GC time 7.419s ( 7.427s elapsed) EXIT time 0.000s ( 0.010s elapsed) Total time 35.269s ( 39.601s elapsed)
Alloc rate 2,191,122,004 bytes per MUT second
Productivity 79.0% of total user, 81.2% of total elapsed
Performance counter stats for '/export/home/karel/sfw/ghc-8.10.3/bin/ghc -H32m -O -Wall -optc-Wall -O0 -hide-all-packages -package ghc-prim -package base -package binary -package array -package transformers -package time -package containers -package bytestring -package deepseq -package process -package pretty -package directory -package filepath -package template-haskell -package unix --make utils/ghc-cabal/Main.hs -o utils/ghc-cabal/dist/build/tmp/ghc-cabal -no-user-package-db -Wall -fno-warn-unused-imports -fno-warn-warnings-deprecations -DCABAL_VERSION=3,4,0,0 -DBOOTSTRAPPING -odir bootstrapping -hidir bootstrapping libraries/Cabal/Cabal/Distribution/Fields/Lexer.hs -ilibraries/Cabal/Cabal -ilibraries/binary/src -ilibraries/filepath -ilibraries/hpc -ilibraries/mtl -ilibraries/text/src libraries/text/cbits/cbits.c -Ilibraries/text/include -ilibraries/parsec/src +RTS -T -s -RTS':
39,632.99 msec task-clock # 0.999 CPUs utilized 17,191 context-switches # 0.434 K/sec
0 cpu-migrations # 0.000 K/sec
899,930 page-faults # 0.023 M/sec
177,636,979,975 cycles # 4.482 GHz (87.54%) 181,945,795,221 instructions # 1.02 insn per cycle (87.59%) 34,033,574,511 branches # 858.718 M/sec (87.42%) 1,664,969,299 branch-misses # 4.89% of all branches (87.48%) 41,522,737,426 L1-dcache-loads # 1047.681 M/sec (87.53%) 2,675,319,939 L1-dcache-load-misses # 6.44% of all L1-dcache hits (87.48%) 372,370,395 LLC-loads # 9.395 M/sec (87.49%) 173,614,140 LLC-load-misses # 46.62% of all LL-cache hits (87.46%)
39.663103602 seconds time elapsed
38.288158000 seconds user 1.358263000 seconds sys _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org mailto:ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org mailto:ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

After the idea of letting marge accept unexpected perf improvements and looking at https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4759 which failed because of a single test, for a single build flavour crossing the improvement threshold where CI fails after rebasing I wondered. When would accepting a unexpected perf improvement ever backfire? In practice I either have a patch that I expect to improve performance for some things so I want to accept whatever gains I get. Or I don't expect improvements so it's *maybe* worth failing CI for in case I optimized away some code I shouldn't or something of that sort. How could this be actionable? Perhaps having a set of indicator for CI of "Accept allocation decreases" "Accept residency decreases" Would be saner. I have personally *never* gotten value out of the requirement to list the indivial tests that improve. Usually a whole lot of them do. Some cross the threshold so I add them. If I'm unlucky I have to rebase and a new one might make it across the threshold. Being able to accept improvements (but not regressions) wholesale might be a reasonable alternative. Opinions?

What about the case where the rebase *lessens* the improvement? That is, you're expecting these 10 cases to improve, but after a rebase, only 1 improves. That's news! But a blanket "accept improvements" won't tell you. I'm not hard against this proposal, because I know precise tracking has its own costs. Just wanted to bring up another scenario that might be factored in. Richard
On Mar 24, 2021, at 7:44 AM, Andreas Klebinger
wrote: After the idea of letting marge accept unexpected perf improvements and looking at https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4759 which failed because of a single test, for a single build flavour crossing the improvement threshold where CI fails after rebasing I wondered.
When would accepting a unexpected perf improvement ever backfire?
In practice I either have a patch that I expect to improve performance for some things so I want to accept whatever gains I get. Or I don't expect improvements so it's *maybe* worth failing CI for in case I optimized away some code I shouldn't or something of that sort.
How could this be actionable? Perhaps having a set of indicator for CI of "Accept allocation decreases" "Accept residency decreases"
Would be saner. I have personally *never* gotten value out of the requirement to list the indivial tests that improve. Usually a whole lot of them do. Some cross the threshold so I add them. If I'm unlucky I have to rebase and a new one might make it across the threshold.
Being able to accept improvements (but not regressions) wholesale might be a reasonable alternative.
Opinions?
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Yes, this is exactly one of the issues that marge might run into as well,
the aggregate ends up performing differently from the individual ones. Now
we have marge to ensure that at least the aggregate builds together, which
is the whole point of these merge trains. Not to end up in a situation
where two patches that are fine on their own, end up to produce a broken
merged state that doesn't build anymore.
Now we have marge to ensure every commit is buildable. Next we should run
regression tests on all commits on master (and that includes each and
everyone that marge brings into master. Then we have visualisation that
tells us how performance metrics go up/down over time, and we can drill
down into commits if they yield interesting results in either way.
Now lets say you had a commit that should have made GHC 50% faster across
the board, but somehow after the aggregate with other patches this didn't
happen anymore? We'd still expect this to somehow show in each of the
singular commits on master right?
On Wed, Mar 24, 2021 at 8:09 PM Richard Eisenberg
What about the case where the rebase *lessens* the improvement? That is, you're expecting these 10 cases to improve, but after a rebase, only 1 improves. That's news! But a blanket "accept improvements" won't tell you.
I'm not hard against this proposal, because I know precise tracking has its own costs. Just wanted to bring up another scenario that might be factored in.
Richard
On Mar 24, 2021, at 7:44 AM, Andreas Klebinger
wrote: After the idea of letting marge accept unexpected perf improvements and looking at https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4759 which failed because of a single test, for a single build flavour crossing the improvement threshold where CI fails after rebasing I wondered.
When would accepting a unexpected perf improvement ever backfire?
In practice I either have a patch that I expect to improve performance for some things so I want to accept whatever gains I get. Or I don't expect improvements so it's *maybe* worth failing CI for in case I optimized away some code I shouldn't or something of that sort.
How could this be actionable? Perhaps having a set of indicator for CI of "Accept allocation decreases" "Accept residency decreases"
Would be saner. I have personally *never* gotten value out of the requirement to list the indivial tests that improve. Usually a whole lot of them do. Some cross the threshold so I add them. If I'm unlucky I have to rebase and a new one might make it across the threshold.
Being able to accept improvements (but not regressions) wholesale might be a reasonable alternative.
Opinions?
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

What about the case where the rebase *lessens* the improvement? That is, you're expecting these 10 cases to improve, but after a rebase, only 1 improves. That's news! But a blanket "accept improvements" won't tell you.
I don't think that scenario currently triggers a CI failure. So this wouldn't really change. As I understand it the current logic is: * Run tests * Check if any cross the metric thresholds set in the test. * If so check if that test is allowed to cross the threshold. I believe we don't check that all benchmarks listed with an expected in/decrease actually do so. It would also be hard to do so reasonably without making it even harder to push MRs through CI. Andreas Am 24/03/2021 um 13:08 schrieb Richard Eisenberg:
What about the case where the rebase *lessens* the improvement? That is, you're expecting these 10 cases to improve, but after a rebase, only 1 improves. That's news! But a blanket "accept improvements" won't tell you.
I'm not hard against this proposal, because I know precise tracking has its own costs. Just wanted to bring up another scenario that might be factored in.
Richard
On Mar 24, 2021, at 7:44 AM, Andreas Klebinger
wrote: After the idea of letting marge accept unexpected perf improvements and looking at https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4759 which failed because of a single test, for a single build flavour crossing the improvement threshold where CI fails after rebasing I wondered.
When would accepting a unexpected perf improvement ever backfire?
In practice I either have a patch that I expect to improve performance for some things so I want to accept whatever gains I get. Or I don't expect improvements so it's *maybe* worth failing CI for in case I optimized away some code I shouldn't or something of that sort.
How could this be actionable? Perhaps having a set of indicator for CI of "Accept allocation decreases" "Accept residency decreases"
Would be saner. I have personally *never* gotten value out of the requirement to list the indivial tests that improve. Usually a whole lot of them do. Some cross the threshold so I add them. If I'm unlucky I have to rebase and a new one might make it across the threshold.
Being able to accept improvements (but not regressions) wholesale might be a reasonable alternative.
Opinions?
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Karel Gardas
On 3/17/21 4:16 PM, Andreas Klebinger wrote:
Now that isn't really an issue anyway I think. The question is rather is 2% a large enough regression to worry about? 5%? 10%?
5-10% is still around system noise even on lightly loaded workstation. Not sure if CI is not run on some shared cloud resources where it may be even higher.
I think when we say "performance" we should be clear about what we are referring to. Currently, GHC does not measure instructions/cycles/time. We only measure allocations and residency. These are significantly more deterministic than time measurements, even on cloud hardware. I do think that eventually we should start to measure a broader spectrum of metrics, but this is something that can be done on dedicated hardware as a separate CI job.
I've done simple experiment of pining ghc compiling ghc-cabal and I've been able to "speed" it up by 5-10% on W-2265.
Do note that once we switch to Hadrian ghc-cabal will vanish entirely (since Hadrian implements its functionality directly).
Also following this CI/performance regs discussion I'm not entirely sure if this is not just a witch-hunt hurting/beating mostly most active GHC developers. Another idea may be to give up on CI doing perf reg testing at all and invest saved resources into proper investigation of GHC/Haskell programs performance. Not sure, if this would not be more beneficial longer term.
I don't think this would be beneficial. It's much easier to prevent a regression from getting into the tree than it is to find and characterise it after it has been merged.
Just one random number thrown to the ring. Linux's perf claims that nearly every second L3 cache access on the example above ends with cache miss. Is it a good number or bad number? See stats below (perf stat -d on ghc with +RTS -T -s -RTS').
It is very hard to tell; it sounds bad but it is not easy to know why or whether it is possible to improve. This is one of the reasons why I have been trying to improve sharing within GHC recently; reducing residency should improve cache locality. Nevertheless, the difficulty interpreting architectural events is why I generally only use `perf` for differential measurements. Cheers, - Ben

We need to do something about this, and I'd advocate for just not making stats fail with marge.
Generally I agree. One point you don’t mention is that our perf tests (which CI forces us to look at assiduously) are often pretty weird cases. So there is at least a danger that these more exotic cases will stand in the way of (say) a perf improvement in the typical case.
But “not making stats fail” is a bit crude. Instead how about
* Always accept stat improvements
* We already have per-benchmark windows. If the stat falls outside the window, we fail. You are effectively saying “widen all windows to infinity”. If something makes a stat 10 times worse, I think we *should* fail. But 10% worse? Maybe we should accept and look later as you suggest. So I’d argue for widening the windows rather than disabling them completely.
* If we did that we’d need good instrumentation to spot steps and drift in perf, as you say. An advantage is that since the perf instrumentation runs only on committed master patches, not on every CI, it can cost more. In particular , it could run a bunch of “typical” tests, including nofib and compiling Cabal or other libraries.
The big danger is that by relieving patch authors from worrying about perf drift, it’ll end up in the lap of the GHC HQ team. If it’s hard for the author of a single patch (with which she is intimately familiar) to work out why it’s making some test 2% worse, imagine how hard, and demotivating, it’d be for Ben to wonder why 50 patches (with which he is unfamiliar) are making some test 5% worse.
I’m not sure how to address this problem. At least we should make it clear that patch authors are expected to engage *actively* in a conversation about why their patch is making something worse, even after it lands.
Simon
From: ghc-devs

Simon Peyton Jones via ghc-devs
We need to do something about this, and I'd advocate for just not making stats fail with marge.
Generally I agree. One point you don’t mention is that our perf tests (which CI forces us to look at assiduously) are often pretty weird cases. So there is at least a danger that these more exotic cases will stand in the way of (say) a perf improvement in the typical case.
But “not making stats fail” is a bit crude. Instead how about
To be clear, the proposal isn't to accept stats failures for merge request validation jobs. I believe Moritz was merely suggesting that we accept such failures in marge-bot validations (that is, the pre-merge validation done on batches of merge requests). In my opinion this is reasonable since we know that all of the MRs in the batch do not individually regress. While it's possible that interactions between two or more MRs result in a qualitative change in performance, it seems quite unlikely. What is far *more* likely (and what we see regularly) is that the cumulative effect of a batch of improving patches pushes the batches' overall stat change out of the acceptance threshold. This is quite annoying as it dooms the entire batch. For this reason, I think we should at very least accept stat improvements during Marge validations (as you suggest). I agree that we probably want a batch to fail if two patches accumulate to form a regression, even if the two passed CI individually.
* We already have per-benchmark windows. If the stat falls outside the window, we fail. You are effectively saying “widen all windows to infinity”. If something makes a stat 10 times worse, I think we *should* fail. But 10% worse? Maybe we should accept and look later as you suggest. So I’d argue for widening the windows rather than disabling them completely.
Yes, I agree.
* If we did that we’d need good instrumentation to spot steps and drift in perf, as you say. An advantage is that since the perf instrumentation runs only on committed master patches, not on every CI, it can cost more. In particular , it could run a bunch of “typical” tests, including nofib and compiling Cabal or other libraries.
We already have the beginnings of such instrumentation.
The big danger is that by relieving patch authors from worrying about perf drift, it’ll end up in the lap of the GHC HQ team. If it’s hard for the author of a single patch (with which she is intimately familiar) to work out why it’s making some test 2% worse, imagine how hard, and demotivating, it’d be for Ben to wonder why 50 patches (with which he is unfamiliar) are making some test 5% worse.
Yes, I absolutely agree with this. I would very much like to avoid having to do this sort of post-hoc investigation any more than necessary. Cheers, - Ben
participants (11)
-
Andreas Klebinger
-
Ben Gamari
-
davean
-
John Ericson
-
Karel Gardas
-
Merijn Verstraaten
-
Moritz Angermann
-
Richard Eisenberg
-
Sebastian Graf
-
Simon Peyton Jones
-
Spiwack, Arnaud