Benchmarking harnesses for a more modern nofib?

Hi all, Is anyone currently working in, or interested in helping with, a new benchmark suite for Haskell? Perhaps, packaging up existing apps and app benchmarks into a new benchmark suite that gives a broad picture of Haskell application performance today? Background: We run nofib, and we run the shootout benchmarks. But when we want to evaluate basic changes to GHC optimizations or data representation, these really don't give us a full picture of whether a change is beneficial. A few years ago, fibon https://hackage.haskell.org/package/fibon tried to gather some Hackage benchmarks. This may work even better with Stackage, where there are 180 benchmark suites among the 1770 packages currently. Also, these days companies are building substantial apps in Hackage. Which substantial apps could or should go into a benchmark suite? I see Warp and other web server benchmarks http://www.infoq.com/news/2015/04/web-frameworks-benchmark-2015?utm_source=infoqEmail&utm_medium=WeeklyNL_EditorialContentOperationsInfrastructure&utm_campaign=04282015news all over the web. But is there a harness that can time some of this code while running inside a single-machine, easy-setup benchmark suite? Best, -Ryan

Is anyone currently working in, or interested in helping with, a new benchmark suite for Haskell? Perhaps, packaging up existing apps and app benchmarks into a new benchmark suite that gives a broad picture of Haskell application performance today?
I would love to see this done. nofib is showing its age.
An incentive is this: we benchmark GHC against nofib pretty regularly, and pay attention to regressions. If your program is in the benchmark suite, it’s more likely that its performance will be good and stay good.
The tension is that, to be usable, it must be possible to actually run the benchmark suite, on a variety of platforms, without consuming too much time.
· nofib has zero package dependencies. Adding some dependencies is fine, but it adding zillions is not. Often they can be cut down because some of the dependencies are related to incidental features of the benchmark that can be stubbed off.
· More seriously, for figures to be comparable we have to compare the same code. So any package dependencies must be hard dependencies on particular versions. And as GHC moves on, those packages may require (hopefully minor) updates to stay working.
· Test data and test environment can be a challenge, especially for things like web servers. Again we don’t to force the developer to install too much other stuff.
All that said, it must be possible to do MUCH better than we are right now, with a 20-year old suite! Please do join Ryan in working on this.
Simon
From: ghc-devs [mailto:ghc-devs-bounces@haskell.org] On Behalf Of Ryan Newton
Sent: 04 April 2016 06:06
To: ghc-devs@haskell.org; Haskell Cafe

Hi, definitely interested. I have just talked about this with Richard yesterday, and we quite agree with what you observed. As the maintainer of http://perf.haskell.org/ghc, I’d very much welcome better data! Note that fibon already has bitrotted, and does not quite work any more. So there is some low hanging fruit in resurrecting that one. Simon mentioned a few points, such as dependencies. But note that you can relatively easily dump the dependencies’s modules in your source repository to both bundle and freeze them. I’d prefer that to (even strict) dependencies to something external, as that lowers the barrier for developers to actually run the benchmarks! Another important step in that direction would be to define a common output for benchmark suites defined in .cabal files, so it is easier to set up things like http://perf.haskell.org/ghc and http://perf.haskell. org/binary for these projects. About the harness: haskell.org is currently paying a student (CCed) to setup a travis-like infrastructure based on gipeda (the software behind perf.haskell.org) that would allow library authors to very simply get continuous benchmark measurements. Let’s see what comes out of that! Greetings, Joachim Am Montag, den 04.04.2016, 01:06 -0400 schrieb Ryan Newton:
Hi all,
Is anyone currently working in, or interested in helping with, a new benchmark suite for Haskell? Perhaps, packaging up existing apps and app benchmarks into a new benchmark suite that gives a broad picture of Haskell application performance today?
Background: We run nofib, and we run the shootout benchmarks. But when we want to evaluate basic changes to GHC optimizations or data representation, these really don't give us a full picture of whether a change is beneficial.
A few years ago, fibon tried to gather some Hackage benchmarks. This may work even better with Stackage, where there are 180 benchmark suites among the 1770 packages currently.
Also, these days companies are building substantial apps in Hackage. Which substantial apps could or should go into a benchmark suite? I see Warp and other web server benchmarks all over the web. But is there a harness that can time some of this code while running inside a single-machine, easy-setup benchmark suite?
Best, -Ryan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe -- Joachim “nomeata” Breitner mail@joachim-breitner.de • https://www.joachim-breitner.de/ XMPP: nomeata@joachim-breitner.de • OpenPGP-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

Great! I'm glad to hear folks are interested. It sounds like there is need for a better low-dependencies benchmark suite. I was just grepping through nofib looking for things that are *missing* and I realized there are no uses of atomicModifyIORef, for example. What we're working on at Indiana right this second is not quite this effort, but is the separate, complementary, effort to gather as much data as possible from a large swath of packages (high dependency-count) . Note that fibon already has bitrotted, and does not quite work any
more. So there is some low hanging fruit in resurrecting that one.
Agreed. Though I see that nofib already contains some of them. Even though stack + GHC head loses many of stack's benefits, I think that stack and cabal freeze should make it easier to keep things running for the long term than it was with fibon (which bitrotted quickly).
Another important step in that direction would be to define a common output for benchmark suites defined in .cabal files, so it is easier to set up things like http://perf.haskell.org/ghc and http://perf.haskell. org/binary for these projects.
Yes, exitcode-stdio-1.0 is useful for testing but not so much for benchmarking. To attempt to harvest Stackage benchmarks we were going to just assume things are criterion and catch errors as we go. Should we go further and aim to standardize a new value for "type:" within benchmark suites? About the harness: haskell.org is currently paying a student (CCed) to
setup a travis-like infrastructure based on gipeda (the software behind perf.haskell.org) that would allow library authors to very simply get continuous benchmark measurements. Let’s see what comes out of that!
What's the infrastructure that currently gathers the data for perf.haskell.org? Is there a repo you can point to? (Since gipeda itself is just the presentation layer, and something else must be running things & gathering data.) Cheers, -Ryan

Hi Ryan, Am Montag, den 04.04.2016, 16:35 -0400 schrieb Ryan Newton:
What's the infrastructure that currently gathers the data for perf.haskell.org? Is there a repo you can point to? (Since gipeda itself is just the presentation layer, and something else must be running things & gathering data.)
the infrastructure is my office desktop computer, which I don’t use (as I always use my laptop). There, a script¹ runs which polls the git repository, looks for new revisions, builds them, and pushes the logs to a dedicate repository on github² (which can be valuable data source on its own!). A cron job on a virtual machine provided to me by haskell.org polls that repository, runs gipepda, and pushes the output onto www.haskell.org, which serves perf.haskell.org. Not the most sophisticated or robust setup, but it works. Greetings, Joachim ¹ https://github.com/nomeata/codespeed/blob/ghc/tools/ghc/watch.sh (the repository location is an historical artifact, I really should move this into the gipeda repo or somewhere else). ² https://github.com/nomeata/ghc-speed-logs I suggest to not do a full checkout, but rather do a bare clone and use commands like "git show" to access the files, as this repository is huge! This is also what gipeda does. -- Joachim “nomeata” Breitner mail@joachim-breitner.de • https://www.joachim-breitner.de/ XMPP: nomeata@joachim-breitner.de • OpenPGP-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

(sorry for duplicates, realized I couldn't post to the mailing list without registering) Hi Ryan, I'm the student working on the CI part Joachim mentioned. It's not quite there yet, but the ground work is done. Basically, I'm writing a daemon that will read a config file for a list of git repositories, maintain periodically updated clones for them and execute a benchmark script on new (Repository, Commit) pairs, after which gipeda is (re-)run, so that might be exactly what you are looking for. You can find the current code at https://github.com/sgraf812/feed-gipeda. There are still some rough edges (not on Hackage yet, somewhat laborious setup) and documentation is somewhat incomplete, but you can always leave me a mail if you get stuck while setting it up. Also, documentation etc. should come within the next week. I'll CC you in the mail thread where we discuss things, if you don't mind. Regards, Sebastian On Monday, 4 April 2016 22:35:56 UTC+2, Ryan Newton wrote:
Great! I'm glad to hear folks are interested.
It sounds like there is need for a better low-dependencies benchmark suite. I was just grepping through nofib looking for things that are *missing* and I realized there are no uses of atomicModifyIORef, for example.
What we're working on at Indiana right this second is not quite this effort, but is the separate, complementary, effort to gather as much data as possible from a large swath of packages (high dependency-count) .
Note that fibon already has bitrotted, and does not quite work any
more. So there is some low hanging fruit in resurrecting that one.
Agreed. Though I see that nofib already contains some of them.
Even though stack + GHC head loses many of stack's benefits, I think that stack and cabal freeze should make it easier to keep things running for the long term than it was with fibon (which bitrotted quickly).
Another important step in that direction would be to define a common output for benchmark suites defined in .cabal files, so it is easier to set up things like http://perf.haskell.org/ghc and http://perf.haskell. org/binary for these projects.
Yes, exitcode-stdio-1.0 is useful for testing but not so much for benchmarking. To attempt to harvest Stackage benchmarks we were going to just assume things are criterion and catch errors as we go. Should we go further and aim to standardize a new value for "type:" within benchmark suites?
About the harness: haskell.org is currently paying a student (CCed) to
setup a travis-like infrastructure based on gipeda (the software behind perf.haskell.org) that would allow library authors to very simply get continuous benchmark measurements. Let’s see what comes out of that!
What's the infrastructure that currently gathers the data for perf.haskell.org? Is there a repo you can point to? (Since gipeda itself is just the presentation layer, and something else must be running things & gathering data.)
Cheers, -Ryan

Ryan Newton
Hi all, Is anyone currently working in, or interested in helping with, a new benchmark suite for Haskell? Perhaps, packaging up existing apps and app benchmarks into a new benchmark suite that gives a broad picture of Haskell application performance today?
I am very interested. I recently encountered a serious performance regression from 7.8 to 7.10 which seems to be fixed in 8.0. Now it's not clear whether this was a library change or ghc itself. I suspect the latter but given the performance is better in 8.0, I was not motivated to confirm this. I am happy to wrap up my example in whatever format but note that it does pull in quite a few libraries. Dominic.

The cause of the initial blowup was adding the vector test suite to the
normal cabal file, in both 0O and O2 forms. This was done so that fusion
rules based bugs wouldn't lie in hiding for years.
It sounds like the issues that resulted in built blowup are fixed in 8.0.
My guess is it's a combination of some type class coercion / equality
manipulation getting better plus other improvements in 8.0, both.
On Tuesday, April 5, 2016, Dominic Steinitz
Ryan Newton
writes: Hi all, Is anyone currently working in, or interested in helping with, a new benchmark suite for Haskell? Perhaps, packaging up existing apps and app benchmarks into a new benchmark suite that gives a broad picture of Haskell application performance today?
I am very interested. I recently encountered a serious performance regression from 7.8 to 7.10 which seems to be fixed in 8.0. Now it's not clear whether this was a library change or ghc itself. I suspect the latter but given the performance is better in 8.0, I was not motivated to confirm this.
I am happy to wrap up my example in whatever format but note that it does pull in quite a few libraries.
Dominic. _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org javascript:; http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

FYI, moving discussion off the ghc-devs list to avoid spamming it. Check out this mailing list if interested: https://groups.google.com/forum/#!forum/haskell-benchmark-infrastructure So far, we're discussing benchmark harnesses and perf dashboards, but I'm sure we'll get to the benchmarks themselves at some point. On Tue, Apr 5, 2016 at 11:59 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
The cause of the initial blowup was adding the vector test suite to the normal cabal file, in both 0O and O2 forms. This was done so that fusion rules based bugs wouldn't lie in hiding for years.
It sounds like the issues that resulted in built blowup are fixed in 8.0. My guess is it's a combination of some type class coercion / equality manipulation getting better plus other improvements in 8.0, both.
On Tuesday, April 5, 2016, Dominic Steinitz
wrote: Ryan Newton
writes: Hi all, Is anyone currently working in, or interested in helping with, a new benchmark suite for Haskell? Perhaps, packaging up existing apps and app benchmarks into a new benchmark suite that gives a broad picture of Haskell application performance today?
I am very interested. I recently encountered a serious performance regression from 7.8 to 7.10 which seems to be fixed in 8.0. Now it's not clear whether this was a library change or ghc itself. I suspect the latter but given the performance is better in 8.0, I was not motivated to confirm this.
I am happy to wrap up my example in whatever format but note that it does pull in quite a few libraries.
Dominic. _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Ryan Newton
FYI, moving discussion off the ghc-devs list to avoid spamming it. Check out this mailing list if interested:
https://groups.google.com/forum/#!forum/haskell-benchmark-infrastructure
So far, we're discussing benchmark harnesses and perf dashboards, but I'm sure we'll get to the benchmarks themselves at some point.
Thanks for this, Ryan! Attention here is badly needed. After the 8.0.1 release Austin and I will be turning some attention to this area as well. I'm looking forward to future discussions. Cheers, - Ben
participants (7)
-
Ben Gamari
-
Carter Schonwald
-
Dominic Steinitz
-
Joachim Breitner
-
Ryan Newton
-
Sebastian Graf
-
Simon Peyton Jones