nofib comparisons between 7.0.4, 7.4.2, 7.6.1, and 7.6.2

newer
RE: Stable pointers and hash table...

Johan Tibell

4 Feb 2013 4 Feb '13

5:33 p.m.

Hi all, I haven't had much time to do performance tzar work yet, but I did run nofib on the last few GHC releases to see the current trend. The benchmarks where run on my 64-bit Core i7-3770 @ 3.40GHz Linux machine. Here are the results: 7.0.4 to 7.4.2: ------------------------------------------------------------ -------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -1.6% -57.3% -39.1% -36.4% -25.0% Max +21.5% +121.5% +24.5% +25.4% +300.0% Geometric Mean +8.5% -0.7% -7.1% -5.2% +2.0% The big loser here in terms of runtime is "kahan", which I added to test tight loops involving unboxed arrays and floating point arithmetic. I believe there was a regression in fromIntegral RULES during this release, which meant that some conversions between fixed-width types went via Integer, causing unnecessary allocation. 7.4.2 to 7.6.1: -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -5.1% -23.8% -11.8% -12.9% -50.0% Max +5.3% +225.5% +7.2% +8.8% +200.0% Geometric Mean -0.4% +2.1% +0.3% +0.2% +0.7% The biggest loser here in terms of runtime is "integrate". I haven't looked into why yet. 7.6.1 to 7.6.2: -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -2.9% +0.0% -4.8% -4.4% -1.9% Max +0.0% +1.0% +4.5% +6.4% +20.8% Geometric Mean -1.7% +0.0% +0.1% +0.3% +0.2% I have two takeaways: * It's worthwhile running nofib before releases as it does find some programs that regressed. * There are some other regressions out there (i.e. in code on Hackage) that aren't reflected here, suggesting that we need to add more programs to nofib. Cheers, Johan

Attachments:

attachment.html (text/html — 5.0 KB)

Show replies by date

Austin Seipp

4 Feb 4 Feb

11:22 p.m.

I'm +1 for this. Eyal Lotem and I were just discussing this on IRC a few minutes ago, and he suffered a rather large (~25%) performance hit when upgrading to 7.6.1, which is unfortunate. Committers are typically very good about recording nofib results in their commit and being performance-courteous, but I'm not sure there's ever been a longer-scale view of GHC performance over multiple releases like this - or even a few months. At least not recently. On top of that, his application was a type checker, which may certainly stress different performance points than what nofib might. Once we get performance bots set up, I've got a small set of machines I'm willing to throw at it. Thanks for the results, Johan! On Mon, Feb 4, 2013 at 4:33 PM, Johan Tibell wrote:

...

Hi all,

I haven't had much time to do performance tzar work yet, but I did run nofib on the last few GHC releases to see the current trend. The benchmarks where run on my 64-bit Core i7-3770 @ 3.40GHz Linux machine. Here are the results:

7.0.4 to 7.4.2:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -1.6% -57.3% -39.1% -36.4% -25.0% Max +21.5% +121.5% +24.5% +25.4% +300.0% Geometric Mean +8.5% -0.7% -7.1% -5.2% +2.0%

The big loser here in terms of runtime is "kahan", which I added to test tight loops involving unboxed arrays and floating point arithmetic. I believe there was a regression in fromIntegral RULES during this release, which meant that some conversions between fixed-width types went via Integer, causing unnecessary allocation.

7.4.2 to 7.6.1:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -5.1% -23.8% -11.8% -12.9% -50.0% Max +5.3% +225.5% +7.2% +8.8% +200.0% Geometric Mean -0.4% +2.1% +0.3% +0.2% +0.7%

The biggest loser here in terms of runtime is "integrate". I haven't looked into why yet.

7.6.1 to 7.6.2:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -2.9% +0.0% -4.8% -4.4% -1.9% Max +0.0% +1.0% +4.5% +6.4% +20.8% Geometric Mean -1.7% +0.0% +0.1% +0.3% +0.2%

I have two takeaways:

* It's worthwhile running nofib before releases as it does find some programs that regressed. * There are some other regressions out there (i.e. in code on Hackage) that aren't reflected here, suggesting that we need to add more programs to nofib.

Cheers, Johan

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

-- Regards, Austin

Simon Peyton-Jones

5 Feb 5 Feb

3:54 a.m.

I'm +10. This is precisely the reason we have our supreme Performance Tsars, to keep us honest. GHC leadership is become increasingly decentralised and I am truly grateful to Bryan and Johan for picking up this particular challenge. My guess is that regressions are accidental and readily fixed, but we can't fix them if we don't know about them. Johan mentions more nofib benchmarks: yes please! But someone has to put them in. Austin, a 25% performance regression moving to 7.6 is not AT ALL what I expect. I generally expect modest performance improvements? Can you characterise more precisely what is happening? The place I always start is to compile the entire thing with -ticky and see where allocation is changing. (Using -prof affects the optimiser too much.) Simon | -----Original Message----- | From: ghc-devs-bounces@haskell.org [mailto:ghc-devs-bounces@haskell.org] | On Behalf Of Austin Seipp | Sent: 05 February 2013 04:22 | To: Johan Tibell | Cc: ghc-devs@haskell.org | Subject: Re: nofib comparisons between 7.0.4, 7.4.2, 7.6.1, and 7.6.2 | | I'm +1 for this. Eyal Lotem and I were just discussing this on IRC a few | minutes ago, and he suffered a rather large (~25%) performance hit when | upgrading to 7.6.1, which is unfortunate. | | Committers are typically very good about recording nofib results in | their commit and being performance-courteous, but I'm not sure there's | ever been a longer-scale view of GHC performance over multiple releases | like this - or even a few months. At least not recently. On top of that, | his application was a type checker, which may certainly stress different | performance points than what nofib might. Once we get performance bots | set up, I've got a small set of machines I'm willing to throw at it. | | Thanks for the results, Johan! | | On Mon, Feb 4, 2013 at 4:33 PM, Johan Tibell | wrote: | > Hi all, | > | > I haven't had much time to do performance tzar work yet, but I did run | > nofib on the last few GHC releases to see the current trend. The | > benchmarks where run on my 64-bit Core i7-3770 @ 3.40GHz Linux | machine. Here are the results: | > | > 7.0.4 to 7.4.2: | > | > ---------------------------------------------------------------------- | ---------- | > Program Size Allocs Runtime Elapsed TotalMem | > ---------------------------------------------------------------------- | ---------- | > Min -1.6% -57.3% -39.1% -36.4% -25.0% | > Max +21.5% +121.5% +24.5% +25.4% +300.0% | > Geometric Mean +8.5% -0.7% -7.1% -5.2% +2.0% | > | > The big loser here in terms of runtime is "kahan", which I added to | > test tight loops involving unboxed arrays and floating point | > arithmetic. I believe there was a regression in fromIntegral RULES | > during this release, which meant that some conversions between | > fixed-width types went via Integer, causing unnecessary allocation. | > | > 7.4.2 to 7.6.1: | > | > ---------------------------------------------------------------------- | ---------- | > Program Size Allocs Runtime Elapsed TotalMem | > ---------------------------------------------------------------------- | ---------- | > Min -5.1% -23.8% -11.8% -12.9% -50.0% | > Max +5.3% +225.5% +7.2% +8.8% +200.0% | > Geometric Mean -0.4% +2.1% +0.3% +0.2% +0.7% | > | > The biggest loser here in terms of runtime is "integrate". I haven't | > looked into why yet. | > | > 7.6.1 to 7.6.2: | > | > ---------------------------------------------------------------------- | ---------- | > Program Size Allocs Runtime Elapsed TotalMem | > ---------------------------------------------------------------------- | ---------- | > Min -2.9% +0.0% -4.8% -4.4% -1.9% | > Max +0.0% +1.0% +4.5% +6.4% +20.8% | > Geometric Mean -1.7% +0.0% +0.1% +0.3% +0.2% | > | > I have two takeaways: | > | > * It's worthwhile running nofib before releases as it does find some | > programs that regressed. | > * There are some other regressions out there (i.e. in code on | > Hackage) that aren't reflected here, suggesting that we need to add | > more programs to nofib. | > | > Cheers, | > Johan | > | > | > _______________________________________________ | > ghc-devs mailing list | > ghc-devs@haskell.org | > http://www.haskell.org/mailman/listinfo/ghc-devs | > | | | | -- | Regards, | Austin | | _______________________________________________ | ghc-devs mailing list | ghc-devs@haskell.org | http://www.haskell.org/mailman/listinfo/ghc-devs

Austin Seipp

10:53 a.m.

On Tue, Feb 5, 2013 at 2:54 AM, Simon Peyton-Jones wrote:

...

Austin, a 25% performance regression moving to 7.6 is not AT ALL what I expect. I generally expect modest performance improvements? Can you characterise more precisely what is happening? The place I always start is to compile the entire thing with -ticky and see where allocation is changing. (Using -prof affects the optimiser too much.)

I have CC'd Eyal just in case. The discussion was informal but he can hopefully provide more context and rigor. I think off hand, this occurred in a rather large-ish application of his (Lamdu?,) and so tracking down precise reasons may prove difficult. I think the most likely case is just those few 'small cuts' accumulate quickly and are reflecting poorly for this particular case - and that's really the worse 'bug report' of all! Hashable/lens alone for example could certainly make a sizable impact here when added up, e.g. [1] is a recent example of an alleged perf anomaly as of late. And the OS could certainly be relevant.[2] All the more reason to expand nofib and get those bots up! [1] https://github.com/tibbe/hashable/issues/57 [2] Just thinking out loud, but, whenever this happens we really need to characterize results on a per-OS/hardware basis if possible in the future, with some relatively detailed hardware info, to be unambiguous. In terms of raw CPU speed, a lot of benchmarks probably won't stand out due to the OS. But OS X is scheduled to get worse in the SMP case soon[3] for example, and if we inevitably try and start doing things like latency or I/O benchmarks, I'm more than certain things will pop up here. [3] See this ticket: http://hackage.haskell.org/trac/ghc/ticket/7602

...

Simon

| -----Original Message----- | From: ghc-devs-bounces@haskell.org [mailto:ghc-devs-bounces@haskell.org] | On Behalf Of Austin Seipp | Sent: 05 February 2013 04:22 | To: Johan Tibell | Cc: ghc-devs@haskell.org | Subject: Re: nofib comparisons between 7.0.4, 7.4.2, 7.6.1, and 7.6.2 | | I'm +1 for this. Eyal Lotem and I were just discussing this on IRC a few | minutes ago, and he suffered a rather large (~25%) performance hit when | upgrading to 7.6.1, which is unfortunate. | | Committers are typically very good about recording nofib results in | their commit and being performance-courteous, but I'm not sure there's | ever been a longer-scale view of GHC performance over multiple releases | like this - or even a few months. At least not recently. On top of that, | his application was a type checker, which may certainly stress different | performance points than what nofib might. Once we get performance bots | set up, I've got a small set of machines I'm willing to throw at it. | | Thanks for the results, Johan! | | On Mon, Feb 4, 2013 at 4:33 PM, Johan Tibell | wrote: | > Hi all, | > | > I haven't had much time to do performance tzar work yet, but I did run | > nofib on the last few GHC releases to see the current trend. The | > benchmarks where run on my 64-bit Core i7-3770 @ 3.40GHz Linux | machine. Here are the results: | > | > 7.0.4 to 7.4.2: | > | > ---------------------------------------------------------------------- | ---------- | > Program Size Allocs Runtime Elapsed TotalMem | > ---------------------------------------------------------------------- | ---------- | > Min -1.6% -57.3% -39.1% -36.4% -25.0% | > Max +21.5% +121.5% +24.5% +25.4% +300.0% | > Geometric Mean +8.5% -0.7% -7.1% -5.2% +2.0% | > | > The big loser here in terms of runtime is "kahan", which I added to | > test tight loops involving unboxed arrays and floating point | > arithmetic. I believe there was a regression in fromIntegral RULES | > during this release, which meant that some conversions between | > fixed-width types went via Integer, causing unnecessary allocation. | > | > 7.4.2 to 7.6.1: | > | > ---------------------------------------------------------------------- | ---------- | > Program Size Allocs Runtime Elapsed TotalMem | > ---------------------------------------------------------------------- | ---------- | > Min -5.1% -23.8% -11.8% -12.9% -50.0% | > Max +5.3% +225.5% +7.2% +8.8% +200.0% | > Geometric Mean -0.4% +2.1% +0.3% +0.2% +0.7% | > | > The biggest loser here in terms of runtime is "integrate". I haven't | > looked into why yet. | > | > 7.6.1 to 7.6.2: | > | > ---------------------------------------------------------------------- | ---------- | > Program Size Allocs Runtime Elapsed TotalMem | > ---------------------------------------------------------------------- | ---------- | > Min -2.9% +0.0% -4.8% -4.4% -1.9% | > Max +0.0% +1.0% +4.5% +6.4% +20.8% | > Geometric Mean -1.7% +0.0% +0.1% +0.3% +0.2% | > | > I have two takeaways: | > | > * It's worthwhile running nofib before releases as it does find some | > programs that regressed. | > * There are some other regressions out there (i.e. in code on | > Hackage) that aren't reflected here, suggesting that we need to add | > more programs to nofib. | > | > Cheers, | > Johan | > | > | > _______________________________________________ | > ghc-devs mailing list | > ghc-devs@haskell.org | > http://www.haskell.org/mailman/listinfo/ghc-devs | > | | | | -- | Regards, | Austin | | _______________________________________________ | ghc-devs mailing list | ghc-devs@haskell.org | http://www.haskell.org/mailman/listinfo/ghc-devs

-- Regards, Austin

Nicolas Frisby

9:24 a.m.

I'd like to investigate the "other regressions out there". Do you have more info? Perhaps a list? Maybe even benchmarking code? Thanks. On Tue, Feb 5, 2013 at 4:22 AM, Austin Seipp wrote:

...

I'm +1 for this. Eyal Lotem and I were just discussing this on IRC a few minutes ago, and he suffered a rather large (~25%) performance hit when upgrading to 7.6.1, which is unfortunate.

Committers are typically very good about recording nofib results in their commit and being performance-courteous, but I'm not sure there's ever been a longer-scale view of GHC performance over multiple releases like this - or even a few months. At least not recently. On top of that, his application was a type checker, which may certainly stress different performance points than what nofib might. Once we get performance bots set up, I've got a small set of machines I'm willing to throw at it.

Thanks for the results, Johan!

On Mon, Feb 4, 2013 at 4:33 PM, Johan Tibell wrote:

...
Hi all,

I haven't had much time to do performance tzar work yet, but I did run nofib on the last few GHC releases to see the current trend. The benchmarks where run on my 64-bit Core i7-3770 @ 3.40GHz Linux machine. Here are the results:

7.0.4 to 7.4.2:

--------------------------------------------------------------------------------

...
Program Size Allocs Runtime Elapsed TotalMem

--------------------------------------------------------------------------------

...
Min -1.6% -57.3% -39.1% -36.4% -25.0% Max +21.5% +121.5% +24.5% +25.4% +300.0% Geometric Mean +8.5% -0.7% -7.1% -5.2% +2.0%

The big loser here in terms of runtime is "kahan", which I added to test tight loops involving unboxed arrays and floating point arithmetic. I believe there was a regression in fromIntegral RULES during this release, which meant that some conversions between fixed-width types went via Integer, causing unnecessary allocation.

7.4.2 to 7.6.1:

--------------------------------------------------------------------------------

...
Program Size Allocs Runtime Elapsed TotalMem

--------------------------------------------------------------------------------

...
Min -5.1% -23.8% -11.8% -12.9% -50.0% Max +5.3% +225.5% +7.2% +8.8% +200.0% Geometric Mean -0.4% +2.1% +0.3% +0.2% +0.7%

The biggest loser here in terms of runtime is "integrate". I haven't

looked

...
into why yet.

7.6.1 to 7.6.2:

--------------------------------------------------------------------------------

...
Program Size Allocs Runtime Elapsed TotalMem

--------------------------------------------------------------------------------

...
Min -2.9% +0.0% -4.8% -4.4% -1.9% Max +0.0% +1.0% +4.5% +6.4% +20.8% Geometric Mean -1.7% +0.0% +0.1% +0.3% +0.2%

I have two takeaways:

* It's worthwhile running nofib before releases as it does find some programs that regressed. * There are some other regressions out there (i.e. in code on Hackage)

that

...
aren't reflected here, suggesting that we need to add more programs to nofib.

Cheers, Johan

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

-- Regards, Austin

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Tim Watson

10:51 a.m.

We have some benchmarks for Cloud Haskell and its underlying network-transport infrastructure that I'm in the process of trying to automate. I'd be very interested to see how these fare against various GHC releases, though I suspect we'll have to tweak the dependencies considerably in order to make the automation happen. I don't know if that fits into the 'other regressions' category or not? Cheers, Tim On 5 Feb 2013, at 14:24, Nicolas Frisby wrote:

...

I'd like to investigate the "other regressions out there".

Do you have more info? Perhaps a list? Maybe even benchmarking code?

Thanks.

On Tue, Feb 5, 2013 at 4:22 AM, Austin Seipp wrote: I'm +1 for this. Eyal Lotem and I were just discussing this on IRC a few minutes ago, and he suffered a rather large (~25%) performance hit when upgrading to 7.6.1, which is unfortunate.

Committers are typically very good about recording nofib results in their commit and being performance-courteous, but I'm not sure there's ever been a longer-scale view of GHC performance over multiple releases like this - or even a few months. At least not recently. On top of that, his application was a type checker, which may certainly stress different performance points than what nofib might. Once we get performance bots set up, I've got a small set of machines I'm willing to throw at it.

Thanks for the results, Johan!

On Mon, Feb 4, 2013 at 4:33 PM, Johan Tibell wrote:

...
Hi all,

I haven't had much time to do performance tzar work yet, but I did run nofib on the last few GHC releases to see the current trend. The benchmarks where run on my 64-bit Core i7-3770 @ 3.40GHz Linux machine. Here are the results:

7.0.4 to 7.4.2:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -1.6% -57.3% -39.1% -36.4% -25.0% Max +21.5% +121.5% +24.5% +25.4% +300.0% Geometric Mean +8.5% -0.7% -7.1% -5.2% +2.0%

The big loser here in terms of runtime is "kahan", which I added to test tight loops involving unboxed arrays and floating point arithmetic. I believe there was a regression in fromIntegral RULES during this release, which meant that some conversions between fixed-width types went via Integer, causing unnecessary allocation.

7.4.2 to 7.6.1:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -5.1% -23.8% -11.8% -12.9% -50.0% Max +5.3% +225.5% +7.2% +8.8% +200.0% Geometric Mean -0.4% +2.1% +0.3% +0.2% +0.7%

The biggest loser here in terms of runtime is "integrate". I haven't looked into why yet.

7.6.1 to 7.6.2:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -2.9% +0.0% -4.8% -4.4% -1.9% Max +0.0% +1.0% +4.5% +6.4% +20.8% Geometric Mean -1.7% +0.0% +0.1% +0.3% +0.2%

I have two takeaways:

* It's worthwhile running nofib before releases as it does find some programs that regressed. * There are some other regressions out there (i.e. in code on Hackage) that aren't reflected here, suggesting that we need to add more programs to nofib.

Cheers, Johan

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

-- Regards, Austin

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Nicolas Frisby

4:24 a.m.

Is anyone familiar with the "fibon" directory within the nofib.git repository? http://darcs.haskell.org/nofib/fibon/ Johan, this at least seems like an potential home for the additional programs you suggested adding. In particular, it has Repa, Dph, Shootout, and Hackage subdirectories. I'm doing a GHC HQ internship at the moment, and one of the just-needs-to-happen tasks on my (growing) todo list is to look into fibon. SPJ recalls that not all of the various building infrastructures were getting along. Anyone know the story? Thanks! On Mon, Feb 4, 2013 at 10:33 PM, Johan Tibell wrote:

...

Hi all,

I haven't had much time to do performance tzar work yet, but I did run nofib on the last few GHC releases to see the current trend. The benchmarks where run on my 64-bit Core i7-3770 @ 3.40GHz Linux machine. Here are the results:

7.0.4 to 7.4.2:

------------------------------------------------------------ -------------------- Program Size Allocs Runtime Elapsed TotalMem

-------------------------------------------------------------------------------- Min -1.6% -57.3% -39.1% -36.4% -25.0% Max +21.5% +121.5% +24.5% +25.4% +300.0% Geometric Mean +8.5% -0.7% -7.1% -5.2% +2.0%

The big loser here in terms of runtime is "kahan", which I added to test tight loops involving unboxed arrays and floating point arithmetic. I believe there was a regression in fromIntegral RULES during this release, which meant that some conversions between fixed-width types went via Integer, causing unnecessary allocation.

7.4.2 to 7.6.1:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem

-------------------------------------------------------------------------------- Min -5.1% -23.8% -11.8% -12.9% -50.0% Max +5.3% +225.5% +7.2% +8.8% +200.0% Geometric Mean -0.4% +2.1% +0.3% +0.2% +0.7%

The biggest loser here in terms of runtime is "integrate". I haven't looked into why yet.

7.6.1 to 7.6.2:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem

-------------------------------------------------------------------------------- Min -2.9% +0.0% -4.8% -4.4% -1.9% Max +0.0% +1.0% +4.5% +6.4% +20.8% Geometric Mean -1.7% +0.0% +0.1% +0.3% +0.2%

I have two takeaways:

* It's worthwhile running nofib before releases as it does find some programs that regressed. * There are some other regressions out there (i.e. in code on Hackage) that aren't reflected here, suggesting that we need to add more programs to nofib.

Cheers, Johan

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Simon Peyton-Jones

5:13 a.m.

I believe fibon/ was helpfully added by someone, but never integrated into the nofib build system. Just needs doing, I think Simon From: ghc-devs-bounces@haskell.org [mailto:ghc-devs-bounces@haskell.org] On Behalf Of Nicolas Frisby Sent: 05 February 2013 09:24 To: Johan Tibell Cc: ghc-devs@haskell.org Subject: Re: nofib comparisons between 7.0.4, 7.4.2, 7.6.1, and 7.6.2 Is anyone familiar with the "fibon" directory within the nofib.git repository? http://darcs.haskell.org/nofib/fibon/ Johan, this at least seems like an potential home for the additional programs you suggested adding. In particular, it has Repa, Dph, Shootout, and Hackage subdirectories. I'm doing a GHC HQ internship at the moment, and one of the just-needs-to-happen tasks on my (growing) todo list is to look into fibon. SPJ recalls that not all of the various building infrastructures were getting along. Anyone know the story? Thanks! On Mon, Feb 4, 2013 at 10:33 PM, Johan Tibell mailto:johan.tibell@gmail.com> wrote: Hi all, I haven't had much time to do performance tzar work yet, but I did run nofib on the last few GHC releases to see the current trend. The benchmarks where run on my 64-bit Core i7-3770 @ 3.40GHz Linux machine. Here are the results: 7.0.4 to 7.4.2: -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -1.6% -57.3% -39.1% -36.4% -25.0% Max +21.5% +121.5% +24.5% +25.4% +300.0% Geometric Mean +8.5% -0.7% -7.1% -5.2% +2.0% The big loser here in terms of runtime is "kahan", which I added to test tight loops involving unboxed arrays and floating point arithmetic. I believe there was a regression in fromIntegral RULES during this release, which meant that some conversions between fixed-width types went via Integer, causing unnecessary allocation. 7.4.2 to 7.6.1: -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -5.1% -23.8% -11.8% -12.9% -50.0% Max +5.3% +225.5% +7.2% +8.8% +200.0% Geometric Mean -0.4% +2.1% +0.3% +0.2% +0.7% The biggest loser here in terms of runtime is "integrate". I haven't looked into why yet. 7.6.1 to 7.6.2: -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -2.9% +0.0% -4.8% -4.4% -1.9% Max +0.0% +1.0% +4.5% +6.4% +20.8% Geometric Mean -1.7% +0.0% +0.1% +0.3% +0.2% I have two takeaways: * It's worthwhile running nofib before releases as it does find some programs that regressed. * There are some other regressions out there (i.e. in code on Hackage) that aren't reflected here, suggesting that we need to add more programs to nofib. Cheers, Johan _______________________________________________ ghc-devs mailing list ghc-devs@haskell.orgmailto:ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Simon Marlow

5:33 a.m.

On 05/02/13 10:13, Simon Peyton-Jones wrote:

...

I believe fibon/ was helpfully added by someone, but never integrated into the nofib build system. Just needs doing, I think

Right - I think it was even integrated into the build system, but it wasn't turned on by default. I tried it once and something didn't work, and I didn't have the time to fix it then. There are some other collections of programs in nofib that aren't run by default: nofib/gc My GC benchmarks (some of these overlap with the rest of nofib, but might have different inputs/parameters). I usually run these when I change something in the GC. nofib/smp The concurrency benchmarks. Edward is using these to tune his new scheduler. These could be enabled by default. nofib/parallel The parallel benchmarks. It wouldn't hurt to run these by default too, on at least 1 core and maybe more. I generally run them on 8 cores when I change something in the RTS. Cheers, Simon

...

Simon

*From:*ghc-devs-bounces@haskell.org [mailto:ghc-devs-bounces@haskell.org] *On Behalf Of *Nicolas Frisby *Sent:* 05 February 2013 09:24 *To:* Johan Tibell *Cc:* ghc-devs@haskell.org *Subject:* Re: nofib comparisons between 7.0.4, 7.4.2, 7.6.1, and 7.6.2

Is anyone familiar with the "fibon" directory within the nofib.git repository?

http://darcs.haskell.org/nofib/fibon/

Johan, this at least seems like an potential home for the additional programs you suggested adding. In particular, it has Repa, Dph, Shootout, and Hackage subdirectories.

I'm doing a GHC HQ internship at the moment, and one of the just-needs-to-happen tasks on my (growing) todo list is to look into fibon.

SPJ recalls that not all of the various building infrastructures were getting along. Anyone know the story? Thanks!

On Mon, Feb 4, 2013 at 10:33 PM, Johan Tibell mailto:johan.tibell@gmail.com> wrote:

Hi all,

I haven't had much time to do performance tzar work yet, but I did run nofib on the last few GHC releases to see the current trend. The benchmarks where run on my 64-bit Core i7-3770 @ 3.40GHz Linux machine. Here are the results:

7.0.4 to 7.4.2:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem

--------------------------------------------------------------------------------

Min -1.6% -57.3% -39.1% -36.4% -25.0%

Max +21.5% +121.5% +24.5% +25.4% +300.0%

Geometric Mean +8.5% -0.7% -7.1% -5.2% +2.0%

The big loser here in terms of runtime is "kahan", which I added to test tight loops involving unboxed arrays and floating point arithmetic. I believe there was a regression in fromIntegral RULES during this release, which meant that some conversions between fixed-width types went via Integer, causing unnecessary allocation.

7.4.2 to 7.6.1:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -5.1% -23.8% -11.8% -12.9% -50.0% Max +5.3% +225.5% +7.2% +8.8% +200.0% Geometric Mean -0.4% +2.1% +0.3% +0.2% +0.7%

The biggest loser here in terms of runtime is "integrate". I haven't looked into why yet.

7.6.1 to 7.6.2:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem

--------------------------------------------------------------------------------

Min -2.9% +0.0% -4.8% -4.4% -1.9%

Max +0.0% +1.0% +4.5% +6.4% +20.8%

Geometric Mean -1.7% +0.0% +0.1% +0.3% +0.2%

I have two takeaways:

* It's worthwhile running nofib before releases as it does find some programs that regressed.

* There are some other regressions out there (i.e. in code on Hackage) that aren't reflected here, suggesting that we need to add more programs to nofib.

Cheers,

Johan

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org mailto:ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

David Terei

6:19 a.m.

On 5 February 2013 02:13, Simon Peyton-Jones wrote:

...

I believe fibon/ was helpfully added by someone, but never integrated into the nofib build system. Just needs doing, I think

No I spent a fair amount of effort fixing this up about 9 months back. At that stage it worked fine, I haven't run for 6 months so not sure any more but they should be close to working at the least.

...

Simon

From: ghc-devs-bounces@haskell.org [mailto:ghc-devs-bounces@haskell.org] On Behalf Of Nicolas Frisby Sent: 05 February 2013 09:24

To: Johan Tibell Cc: ghc-devs@haskell.org Subject: Re: nofib comparisons between 7.0.4, 7.4.2, 7.6.1, and 7.6.2

Is anyone familiar with the "fibon" directory within the nofib.git repository?

http://darcs.haskell.org/nofib/fibon/

Johan, this at least seems like an potential home for the additional programs you suggested adding. In particular, it has Repa, Dph, Shootout, and Hackage subdirectories.

I'm doing a GHC HQ internship at the moment, and one of the just-needs-to-happen tasks on my (growing) todo list is to look into fibon.

SPJ recalls that not all of the various building infrastructures were getting along. Anyone know the story? Thanks!

On Mon, Feb 4, 2013 at 10:33 PM, Johan Tibell wrote:

Hi all,

I haven't had much time to do performance tzar work yet, but I did run nofib on the last few GHC releases to see the current trend. The benchmarks where run on my 64-bit Core i7-3770 @ 3.40GHz Linux machine. Here are the results:

7.0.4 to 7.4.2:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem

--------------------------------------------------------------------------------

Min -1.6% -57.3% -39.1% -36.4% -25.0%

Max +21.5% +121.5% +24.5% +25.4% +300.0%

Geometric Mean +8.5% -0.7% -7.1% -5.2% +2.0%

The big loser here in terms of runtime is "kahan", which I added to test tight loops involving unboxed arrays and floating point arithmetic. I believe there was a regression in fromIntegral RULES during this release, which meant that some conversions between fixed-width types went via Integer, causing unnecessary allocation.

7.4.2 to 7.6.1:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -5.1% -23.8% -11.8% -12.9% -50.0% Max +5.3% +225.5% +7.2% +8.8% +200.0% Geometric Mean -0.4% +2.1% +0.3% +0.2% +0.7%

The biggest loser here in terms of runtime is "integrate". I haven't looked into why yet.

7.6.1 to 7.6.2:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem

--------------------------------------------------------------------------------

Min -2.9% +0.0% -4.8% -4.4% -1.9%

Max +0.0% +1.0% +4.5% +6.4% +20.8%

Geometric Mean -1.7% +0.0% +0.1% +0.3% +0.2%

I have two takeaways:

* It's worthwhile running nofib before releases as it does find some programs that regressed.

* There are some other regressions out there (i.e. in code on Hackage) that aren't reflected here, suggesting that we need to add more programs to nofib.

Cheers,

Johan

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Johan Tibell

12:34 p.m.

On Tue, Feb 5, 2013 at 3:19 AM, David Terei wrote:

...

On 5 February 2013 02:13, Simon Peyton-Jones wrote:

...
I believe fibon/ was helpfully added by someone, but never integrated into the nofib build system. Just needs doing, I think

No I spent a fair amount of effort fixing this up about 9 months back. At that stage it worked fine, I haven't run for 6 months so not sure any more but they should be close to working at the least.

Instead of trying to get fibon to work I'll try to get some of the shootout benchmarks into nofib. These are small micro benchmarks that shouldn't require anything special to run. -- Johan

David Terei

5:11 p.m.

On 5 February 2013 09:34, Johan Tibell wrote:

...

On Tue, Feb 5, 2013 at 3:19 AM, David Terei wrote:

...
On 5 February 2013 02:13, Simon Peyton-Jones wrote:

...
I believe fibon/ was helpfully added by someone, but never integrated into the nofib build system. Just needs doing, I think

No I spent a fair amount of effort fixing this up about 9 months back. At that stage it worked fine, I haven't run for 6 months so not sure any more but they should be close to working at the least.

Instead of trying to get fibon to work I'll try to get some of the shootout benchmarks into nofib. These are small micro benchmarks that shouldn't require anything special to run.

Agreed. The issue with the fibon folder as a whole is a lot of the benchmarks have substantial dependencies as they are taken from Hackage to represent real world programs. This is handled in a very ugly fashion right now by just including a copy of the source of all dependencies. So overtime it will always break as GHC and base changes. Shootout and some of them though don't have dependencies, so we should look at moving them out of the fibon folder and enabling them by default. After that we can look at better ways to handle the dependencies of the remaining fibon benchmarks. Why are you creating new shootout benchmarks though rather than simply move the exiting Shootout folder from fibon/Shootout to the top level and fixing the makefile? Some of this discussion going forward may make more sense on trac. There is trac ticket for improving nofib in general here: http://hackage.haskell.org/trac/ghc/ticket/5793 Cheers, David

...

-- Johan

Johan Tibell

5:49 p.m.

On Tue, Feb 5, 2013 at 2:11 PM, David Terei wrote:

...

Why are you creating new shootout benchmarks though rather than simply move the exiting Shootout folder from fibon/Shootout to the top level and fixing the makefile?

I discussed this with David offline. The summary is that the shootout benchmarks in fibon has bitrotted to the point that they no longer corresponds to the shootout benchmarks on the official site so there's nothing really gained by modifying the current ones. In addition, I've made sure to closely mirror the compilation settings and input sizes used in the shootout.

Johan Tibell

6:48 p.m.

I've now added the shootout programs that could be added without modifying the programs itself. I described why some programs weren't added in nofib/shootout/README. For the curious, here's the change in these benchmarks from 7.0.4 to 7.6.2: -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- binary-trees +2.6% -0.6% -2.8% -2.8% -22.3% fannkuch-redux +1.4%+11514445. +0.2% +0.2% +0.0% n-body +3.8% +0.0% +4.4% +4.4% +0.0% pidigits +2.2% -6.9% -1.7% -1.2% -20.0% spectral-norm +2.1% -61.3% -54.8% -54.8% +0.0% -------------------------------------------------------------------------------- Min +1.4% -61.3% -54.8% -54.8% -22.3% Max +3.8%+11514445. +4.4% +4.4% +0.0% Geometric Mean +2.4% +737.6% -14.7% -14.6% -9.1% Some interesting differences here (and some really good ones)! I looked into fannkuch-redux (nofib/shootout/fannkuch-redux) and confirmed the allocation difference: 7.0.4: 93,680 bytes allocated in the heap 2,880 bytes copied during GC 43,784 bytes maximum residency (1 sample(s)) 21,752 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 38.53s ( 38.56s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 38.53s ( 38.56s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 2,431 bytes per MUT second Productivity 100.0% of total user, 99.9% of total elapsed 7.6.2: 10,538,113,312 bytes allocated in the heap 819,304 bytes copied during GC 44,416 bytes maximum residency (2 sample(s)) 25,216 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 20177 colls, 0 par 0.06s 0.05s 0.0000s 0.0000s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 38.76s ( 38.82s elapsed) GC time 0.06s ( 0.05s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 38.83s ( 38.88s elapsed) %GC time 0.2% (0.1% elapsed) Alloc rate 271,864,153 bytes per MUT second Productivity 99.8% of total user, 99.7% of total elapsed We're going from a essentially non-allocation program to an allocating one. Aside: I didn't use -fllvm, which is what the shootout normally uses. -- Johan

Simon Marlow

6 Feb 6 Feb

5:09 a.m.

On 05/02/13 23:48, Johan Tibell wrote:

...

I've now added the shootout programs that could be added without modifying the programs itself. I described why some programs weren't added in nofib/shootout/README.

For the curious, here's the change in these benchmarks from 7.0.4 to 7.6.2:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- binary-trees +2.6% -0.6% -2.8% -2.8% -22.3% fannkuch-redux +1.4%+11514445. +0.2% +0.2% +0.0% n-body +3.8% +0.0% +4.4% +4.4% +0.0% pidigits +2.2% -6.9% -1.7% -1.2% -20.0% spectral-norm +2.1% -61.3% -54.8% -54.8% +0.0% -------------------------------------------------------------------------------- Min +1.4% -61.3% -54.8% -54.8% -22.3% Max +3.8%+11514445. +4.4% +4.4% +0.0% Geometric Mean +2.4% +737.6% -14.7% -14.6% -9.1%

This is slightly off topic, but I wanted to plant this thought in people's brains: we shouldn't place much significance in the average of a bunch of benchmarks (even the geometric mean), because it assumes that the benchmarks have a sensible distribution, and we have no reason to expect that to be the case. For example, in the results above, we wouldn't expect a 14.7% reduction in runtime to be seen in a typical program. Using the median might be slightly more useful, which here would be something around 0% for runtime, though still technically dodgy. When I get around to it I'll modify nofib-analyse to report medians instead of GMs. Cheers, Simon

Johan Tibell

11:04 a.m.

On Wed, Feb 6, 2013 at 2:09 AM, Simon Marlow wrote:

...

This is slightly off topic, but I wanted to plant this thought in people's brains: we shouldn't place much significance in the average of a bunch of benchmarks (even the geometric mean), because it assumes that the benchmarks have a sensible distribution, and we have no reason to expect that to be the case. For example, in the results above, we wouldn't expect a 14.7% reduction in runtime to be seen in a typical program.

Using the median might be slightly more useful, which here would be something around 0% for runtime, though still technically dodgy. When I get around to it I'll modify nofib-analyse to report medians instead of GMs.

Using the geometric mean as a way to summarize the results isn't that bad. See "How not to lie with statistics: the correct way to summarize benchmark results" (http://ece.uprm.edu/~nayda/Courses/Icom6115F06/Papers/paper4.pdf). That being said, I think the most useful thing to do is to look at the big losers, as they're often regressions. Making some class of programs much worse is but improving the geometric mean overall is often worse than changing nothing at all. -- Johan

Simon Marlow

3:50 p.m.

On 06/02/13 16:04, Johan Tibell wrote:

...

On Wed, Feb 6, 2013 at 2:09 AM, Simon Marlow mailto:marlowsd@gmail.com> wrote:

This is slightly off topic, but I wanted to plant this thought in people's brains: we shouldn't place much significance in the average of a bunch of benchmarks (even the geometric mean), because it assumes that the benchmarks have a sensible distribution, and we have no reason to expect that to be the case. For example, in the results above, we wouldn't expect a 14.7% reduction in runtime to be seen in a typical program.

Using the median might be slightly more useful, which here would be something around 0% for runtime, though still technically dodgy. When I get around to it I'll modify nofib-analyse to report medians instead of GMs.

Using the geometric mean as a way to summarize the results isn't that bad. See "How not to lie with statistics: the correct way to summarize benchmark results" (http://ece.uprm.edu/~nayda/Courses/Icom6115F06/Papers/paper4.pdf).

Yes - our current usage of GM is because we read that paper :) I've reported GMs of nofib programs in several papers. I'm not saying the paper is wrong - the GM is definitely more correct than the AM for averaging normalised results. The problem is that we're attributing equal weight to all of our benchmarks, without any reason to expect that they are representative. We collect as many benchmarks as we can and hope they are representative, but in fact it's rarely the case: often a particular optimisation or regression will hit just one or two benchmarks. So all I'm saying is that we shouldn't expect the GM to be representative. Often there's no sensible mean at all - saying "some programs get a lot better but most don't change" is far more informative than "on average programs got faster by 1.2%".

...

That being said, I think the most useful thing to do is to look at the big losers, as they're often regressions. Making some class of programs much worse is but improving the geometric mean overall is often worse than changing nothing at all.

Absolutely. Cheers, Simon

Andy Georges

5:26 p.m.

Hi Johan, On 06 Feb 2013, at 17:04, Johan Tibell wrote:

...

On Wed, Feb 6, 2013 at 2:09 AM, Simon Marlow wrote: This is slightly off topic, but I wanted to plant this thought in people's brains: we shouldn't place much significance in the average of a bunch of benchmarks (even the geometric mean), because it assumes that the benchmarks have a sensible distribution, and we have no reason to expect that to be the case. For example, in the results above, we wouldn't expect a 14.7% reduction in runtime to be seen in a typical program.

Using the median might be slightly more useful, which here would be something around 0% for runtime, though still technically dodgy. When I get around to it I'll modify nofib-analyse to report medians instead of GMs.

No.

...

Using the geometric mean as a way to summarize the results isn't that bad. See "How not to lie with statistics: the correct way to summarize benchmark results" (http://ece.uprm.edu/~nayda/Courses/Icom6115F06/Papers/paper4.pdf).

I would argue the exact opposite. The geometric mean has absolutely no meaning whatsoever. See e.g., Computer Architecture Performance Evaluation Methods. L. Eeckhout Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, June 2010. Quantifying performance changes with effect size confidence intervals - Tomas Kalibera and Richard Jones, 2012 (tech report) Measuring Computer Performance: A Practitioner's Guide - Lilja, DJ, 2005 The Art of Computer Systems Performance Analysis: techniques for experimental design, measurement, simulation, and modelling - Jain, R. 1991

...

That being said, I think the most useful thing to do is to look at the big losers, as they're often regressions. Making some class of programs much worse is but improving the geometric mean overall is often worse than changing nothing at all.

Yes. Regards, -- Andy Georges PS. I wrote this a while back for the Evaluate collaborator # The Mean Is Not A Simple Average ## Application domain Aggregating measurements. The anti-pattern discusses a single example, though for all uses of an average it is important to consider the right mean to use. Examples of applicable means are: (weighed) arithmetic, (weighed) harmonic, each with respect to the proper weighing factors. ## Premise You have a set of benchmarks and you wish to quantify your Shiny New Idea (SNI). To make it easy to grok the results, you decide to aggregate the impact of your Shiny New Idea with a single performance number: the mean of the various measurements. This means that readers of your paper can easily compare single numbers: those for a baseline system, those for other enhancements you compare with and of course, the number for your SNI. ## Description You have implemented your SNI and you wish to conduct a comparison study to show that your SNI outperforms existing work and improves execution time (or other metrics such as energy consumption, ...) with X% compared to a baseline system. You design an experiment with a set of benchmarks from an applicable benchmark suite and you assemble performance numbers for each benchmark and for each scenario (baseline, your SNI, other work, ...). For example, you assemble executions times (the metric of choice for single-program workloads) and you wish to assess the speedup. Since people prefer single numbers they can compare to see which one is bigger, you must aggregate you data into an average value. While this contains less information that the original data set, it is an easy way to see if your SNI improves things or not and to prove it to your readers or users. You should choose a mean that allows you to: (i) directly compare the alternatives to each other by canceling out the baseline, (ii) make sure (relevant) outliers do not influence your average too much. Clearly, the geometric mean is perfectly suited for this purpose. Without further ado, determine the per-benchmark speedup for each scenario and you compute the various geometric means. The resulting average values immediately allow you to see if your SNI improves the other scenarios derived from existing work. It also allows you to see how much you improve over these scenarios by dividing them by the geometric mean of your SNI. Do not worry, the formula for the geometric mean makes sure that the baseline values are canceled out and you effectively get the average speedup of your SNI compared to existing work. Now go ahead and publish these numbers that support your SNI. ## Why this is a bad idea There may be specific circumstances where the use of a geometric mean is warranted, yet producing the average over some benchmark suite for a performance metric of your choice is not one of them. Typically, the geometric mean can be used when the final aggregate performance number results from multiplying individual numbers. For example, when making several enhancements to a system, the average improvement per enhancement can be expressed as the geometric mean of the speedups resulting from the individual improvements. However, for any benchmark suite (regardless of the relative importance one attaches to each benchmark in the suite), the aggregate results from adding the individual results, as is the case for, e.g., overall speedup. In practically all cases using either the (weighed) arithmetic mean of the (weighed) harmonic mean is the correct way to compute and report an average. While it is true that the geometric mean sustains a smaller impact from outliers in the measurements compared to the other means, one should always investigate outliers and disregard them if there is an indication that the data is wrong. Otherwise, they can provide valuable insight. Moreover, by adding appropriate weights, one can easily reduce the impact of outliers. ## Example Suppose you have 5 benchmarks, B1 ... B5. The baseline system has the following measurements: 10, 15, 7, 12, and 16, which yields a total execution time of 60. Hence, the aggregate score is the sum of the individual scores. Suppose now you wish to compare two difference enhancements. The first enhancement yields the measurements 8, 10, 6, 11, 12 -- adding up to 47; the second enhancement yields the measurements 7, 12, 5, 10, 14 -- adding up to 48. If we take a look at the global improvement achieved, then that is 60/47 = 1.2766 and 48/60 = 1.25 for enhancement 1 and enhancement 2, respectively. Therefore we conclude that by a small margin, enhancement 1 outperforms enhancement 2, for this particular set of benchmarks. However, the geometric means are 1.2604 and 1.2794. From these numbers we would conclude the opposite, namely that enhancement 2 outperforms enhancement 1. Which mean then yields the correct result? The answer is dependent on which system to weigh against. If we weigh against the enhanced system, giving the benchmarks the weights that correspond to their relative execution time compared to the execution time of the complete suite (on the same configuration), then we need to use a weighed arithmetic mean. If we weigh against the baseline system, the correct answer is that we need to use the weighted harmonic mean. Of course, the use of weights is often disregarded. If we assume all benchmarks are of equal importance, then we likely will not weigh them. In that case, all three means yield the same conclusion, but none of them accurately reflect the true speedup that is achieved over the entire suite. ## Why is this pattern relevant The geometric mean is still widely used and accepted by researchers. It can be found in papers published at top venues, such as OOPSLA, PLDI, CGO, etc. It is commonly used by e.g., VMMark, SPEC CPU, ... On multiple occasions the argument regarding the impact of outliers is brought forth, even though there are other ways to deal with outliers. References • [[1]] J.E., Smith. Characterizing computer performance with a single number. CACM 31(10), 1988. • [[2]] D.A., Patterson; J.L., Hennessy. Computer Organization and Design: The Hardware/Software Approach, Morgan Kaufman.

Simon Marlow

7 Feb 7 Feb

4:44 a.m.

On 06/02/13 22:26, Andy Georges wrote:

...

Quantifying performance changes with effect size confidence intervals - Tomas Kalibera and Richard Jones, 2012 (tech report)

This is a good one - it was actually a talk by Richard Jones that highlighted to me the problems with averaging over benchmarks (aside from the problem with GM, which he didn't mention). This paper mentions Criterion, incidentally.

...

• [[1]] J.E., Smith. Characterizing computer performance with a single number. CACM 31(10), 1988.

And I wish I'd read this a long time ago :) Thanks. No more geometric means for me! Cheers, Simon

Andy Georges

4:57 a.m.

Hi all, On 07 Feb 2013, at 10:44, Simon Marlow wrote:

...

On 06/02/13 22:26, Andy Georges wrote:

...
Quantifying performance changes with effect size confidence intervals - Tomas Kalibera and Richard Jones, 2012 (tech report)

This is a good one - it was actually a talk by Richard Jones that highlighted to me the problems with averaging over benchmarks (aside from the problem with GM, which he didn't mention).

The paper has a guide for practitioners that improves on what I did in part of my PhD. I think it could be fairly easy to wrap that around Criterion for comparing runs -- most of your . I should note that a number of people I know are involved in performance measurement think it is a bit too detailed, but if you can implement this in your testing framework, it could be a cool feature that other people start using too.

...

This paper mentions Criterion, incidentally.

Yes :-) I mentioned it several times when we discussed performance measuring in the Evaluate workshops. Since I changed jobs, I am no longer very actively involved here, but some people seem to have picked things up, I guess.

...

...
• [[1]] J.E., Smith. Characterizing computer performance with a single number. CACM 31(10), 1988.

And I wish I'd read this a long time ago :) Thanks. No more geometric means for me!

You are very welcome. Regards, -- Andy

Simon Peyton-Jones

6 Feb 6 Feb

3:35 a.m.

Instead of trying to get fibon to work I'll try to get some of the shootout benchmarks into nofib. These are small micro benchmarks that shouldn't require anything special to run. Thank you! From: Johan Tibell [mailto:johan.tibell@gmail.com] Sent: 05 February 2013 17:34 To: David Terei Cc: Simon Peyton-Jones; Nicolas Frisby; ghc-devs@haskell.org Subject: Re: nofib comparisons between 7.0.4, 7.4.2, 7.6.1, and 7.6.2 On Tue, Feb 5, 2013 at 3:19 AM, David Terei mailto:davidterei@gmail.com> wrote: On 5 February 2013 02:13, Simon Peyton-Jones mailto:simonpj@microsoft.com> wrote:

...

I believe fibon/ was helpfully added by someone, but never integrated into the nofib build system. Just needs doing, I think No I spent a fair amount of effort fixing this up about 9 months back. At that stage it worked fine, I haven't run for 6 months so not sure any more but they should be close to working at the least.

Instead of trying to get fibon to work I'll try to get some of the shootout benchmarks into nofib. These are small micro benchmarks that shouldn't require anything special to run. -- Johan

David Terei

5 Feb 5 Feb

6:17 a.m.

On 5 February 2013 01:24, Nicolas Frisby wrote:

...

Is anyone familiar with the "fibon" directory within the nofib.git repository?

http://darcs.haskell.org/nofib/fibon/

Yes. They are from here: https://github.com/dmpots/fibon Fibon is a newer, alternative benchmarking suite for Haskell done by David M Peixotto. I've used it at times but sadly haven't had much luck, it always seems to take many hours to run on my machine.

...

Johan, this at least seems like an potential home for the additional programs you suggested adding. In particular, it has Repa, Dph, Shootout, and Hackage subdirectories.

I'm doing a GHC HQ internship at the moment, and one of the just-needs-to-happen tasks on my (growing) todo list is to look into fibon.

SPJ recalls that not all of the various building infrastructures were getting along. Anyone know the story? Thanks!

On Mon, Feb 4, 2013 at 10:33 PM, Johan Tibell wrote:

...
Hi all,

I haven't had much time to do performance tzar work yet, but I did run nofib on the last few GHC releases to see the current trend. The benchmarks where run on my 64-bit Core i7-3770 @ 3.40GHz Linux machine. Here are the results:

7.0.4 to 7.4.2:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem

-------------------------------------------------------------------------------- Min -1.6% -57.3% -39.1% -36.4% -25.0% Max +21.5% +121.5% +24.5% +25.4% +300.0% Geometric Mean +8.5% -0.7% -7.1% -5.2% +2.0%

The big loser here in terms of runtime is "kahan", which I added to test tight loops involving unboxed arrays and floating point arithmetic. I believe there was a regression in fromIntegral RULES during this release, which meant that some conversions between fixed-width types went via Integer, causing unnecessary allocation.

7.4.2 to 7.6.1:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem

-------------------------------------------------------------------------------- Min -5.1% -23.8% -11.8% -12.9% -50.0% Max +5.3% +225.5% +7.2% +8.8% +200.0% Geometric Mean -0.4% +2.1% +0.3% +0.2% +0.7%

The biggest loser here in terms of runtime is "integrate". I haven't looked into why yet.

7.6.1 to 7.6.2:

-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem

-------------------------------------------------------------------------------- Min -2.9% +0.0% -4.8% -4.4% -1.9% Max +0.0% +1.0% +4.5% +6.4% +20.8% Geometric Mean -1.7% +0.0% +0.1% +0.3% +0.2%

I have two takeaways:

* It's worthwhile running nofib before releases as it does find some programs that regressed. * There are some other regressions out there (i.e. in code on Hackage) that aren't reflected here, suggesting that we need to add more programs to nofib.

Cheers, Johan

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

4656

Age (days ago)

4659

Last active (days ago)

List overview

Download

21 comments

8 participants

participants (8)

Andy Georges
Austin Seipp
David Terei
Johan Tibell
Nicolas Frisby
Simon Marlow
Simon Peyton-Jones
Tim Watson