RE: [Haskell-cafe] proposal: HaBench, a Haskell Benchmark Suite

Hi Simon et al, I've picked up the HaBench/nofib/nobench issue again, needing a decent set of real applications to do some exploring of what people these days call split-compilation. We have a framework that was able to explore GCC optimisations [1] -- the downside there was the dependency of these optimisations on each other, requiring them to be done in certain order -- for a multi-objective search space, and extended this to exploring a JIT compiler [2] for Java in our case -- which posed its own problems. Going one step further, we'd like to explore the tradeoffs that can be made when compiling on different levels: source to bytecode (in some sense) and bytecode to native. Given that LLVM is quicly becoming a state-of-the-art framework and with the recent GHC support, we figured that Haskell would be an excellent vehicle to conduct our exploration and research (and the fact that some people at our lab have a soft spot for Haskell helps too). Which brings me back to benchmarks. Are there any inputs available that allow the real part of the suite to run for a sufficiently long time? We're going to use criterion in any case given our own expertise with rigorous benchmarking [3,4], but since we've made a case in the past against short running apps on managed runtime systems [5], we'd love to have stuff that runs at least in the order of seconds, while doing useful things. All pointers are much appreciated. Or if any of you out there have (recent) apps with inputs that are open source ... let us know. -- Andy [1] COLE: Compiler Optimization Level Exploration, Kenneth Hoste and Lieven Eeckhout, CGO 2008 [2] Automated Just-In-Time Compiler Tuning, Kenneth Hoste, Andy Georges and Lieven Eeckhout, CGO 2010 [3] Statistically Rigorous Java Performance Evaluation, Andy Georges, Dries Buytaert and Lieven Eeckhout, OOPSLA 2007 [4] Java Performance Evaluation through Rigorous Replay Compilation, Andy Georges, Lieven Eeckhout and Dries Buytaert, OOPSLA 2008 [5] How Java Programs Interact with Virtual Machines at the Microarchitectural Level, Lieven Eeckhout, Andy Georges, Koen De Bosschere, OOPSLA 2003

On 6/24/10 4:24 PM, Andy Georges wrote:
Or if any of you out there have (recent) apps with inputs that are open source ... let us know.
Hi Andy.. you could run the hledger benchmarks, roughly like so: $ cabal install tabular $ darcs get --lazy http://joyful.com/repos/hledger $ cd hledger $ make quickbench and edit bench.tests for different/slower/faster tests.

On 25/06/2010 00:24, Andy Georges wrote:
I've picked up the HaBench/nofib/nobench issue again, needing a decent set of real applications to do some exploring of what people these days call split-compilation. We have a framework that was able to explore GCC optimisations [1] -- the downside there was the dependency of these optimisations on each other, requiring them to be done in certain order -- for a multi-objective search space, and extended this to exploring a JIT compiler [2] for Java in our case -- which posed its own problems. Going one step further, we'd like to explore the tradeoffs that can be made when compiling on different levels: source to bytecode (in some sense) and bytecode to native. Given that LLVM is quicly becoming a state-of-the-art framework and with the recent GHC support, we figured that Haskell would be an excellent vehicle to conduct our exploration and research (and the fact that some people at our lab have a soft spot for Haskell helps too). Which brings me back to benchmarks.
Are there any inputs available that allow the real part of the suite to run for a sufficiently long time? We're going to use criterion in any case given our own expertise with rigorous benchmarking [3,4], but since we've made a case in the past against short running apps on managed runtime systems [5], we'd love to have stuff that runs at least in the order of seconds, while doing useful things. All pointers are much appreciated.
The short answer is no, although some of the benchmarks have tunable input sizes (mainly the spectral ones) and you can 'make mode=slow' to run those with larger inputs. More generally, the nofib suite really needs an overhaul or replacement. Unfortunately it's a tiresome job and nobody really wants to do it. There have been various abortive efforts, including nobench and HaBench. Meanwhile we in the GHC camp continue to use nofib, mainly because we have some tool infrastructure set up to digest the results (nofib-analyse). Unfortunately nofib has steadily degraded in usefulness over time due to both faster processors and improvements in GHC, such that most of the programs now run for less than 0.1s and are ignored by the tools when calculating averages over the suite. We have a need not just for plain Haskell benchmarks, but benchmarks that test - GHC extensions, so we can catch regressions - parallelism (see nofib/parallel) - concurrency (see nofib/smp) - the garbage collector (see nofib/gc) I tend to like quantity over quality: it's very common to get just one benchmark in the whole suite that shows a regression or exercises a particular corner of the compiler or runtime. We should only keep benchmarks that have a tunable input size, however. Criterion works best on programs that run for short periods of time, because it runs the benchmark at least 100 times, whereas for exercising the GC we really need programs that run for several seconds. I'm not sure how best to resolve this conflict. Meanwhile, I've been collecting pointers to interesting programs that cross my radar, in anticipation of waking up with an unexpectedly free week in which to pull together a benchmark suite... clearly overoptimistic! But I'll happily pass these pointers on to anyone with the inclination to do it. Cheers, Simon
Or if any of you out there have (recent) apps with inputs that are open source ... let us know.
-- Andy
[1] COLE: Compiler Optimization Level Exploration, Kenneth Hoste and Lieven Eeckhout, CGO 2008 [2] Automated Just-In-Time Compiler Tuning, Kenneth Hoste, Andy Georges and Lieven Eeckhout, CGO 2010 [3] Statistically Rigorous Java Performance Evaluation, Andy Georges, Dries Buytaert and Lieven Eeckhout, OOPSLA 2007 [4] Java Performance Evaluation through Rigorous Replay Compilation, Andy Georges, Lieven Eeckhout and Dries Buytaert, OOPSLA 2008 [5] How Java Programs Interact with Virtual Machines at the Microarchitectural Level, Lieven Eeckhout, Andy Georges, Koen De Bosschere, OOPSLA 2003

Hi Simon et al, On Jun 25, 2010, at 14:39 PM, Simon Marlow wrote:
On 25/06/2010 00:24, Andy Georges wrote:
<snip> Are there any inputs available that allow the real part of the suite to run for a sufficiently long time? We're going to use criterion in any case given our own expertise with rigorous benchmarking [3,4], but since we've made a case in the past against short running apps on managed runtime systems [5], we'd love to have stuff that runs at least in the order of seconds, while doing useful things. All pointers are much appreciated.
The short answer is no, although some of the benchmarks have tunable input sizes (mainly the spectral ones) and you can 'make mode=slow' to run those with larger inputs.
More generally, the nofib suite really needs an overhaul or replacement. Unfortunately it's a tiresome job and nobody really wants to do it. There have been various abortive efforts, including nobench and HaBench. Meanwhile we in the GHC camp continue to use nofib, mainly because we have some tool infrastructure set up to digest the results (nofib-analyse). Unfortunately nofib has steadily degraded in usefulness over time due to both faster processors and improvements in GHC, such that most of the programs now run for less than 0.1s and are ignored by the tools when calculating averages over the suite.
Right. I have the distinct feeling this is a major lack in the Haskell world. SPEC evolved over time to include larger benchmarks that still excercise the various parts of the hardware, such that the benchmarks does not achieve suddenly a large improvement on a new architecture/implementation due to e.g. a larger cache and the working sets remain in the cache for the entire execution. The Haskell community has nothing that remotely resembles a decent suite. You could do experiments and show that over 10K iterations, the average execution time per iteration goes from 500ms to 450ms, but what does this really mean?
We have a need not just for plain Haskell benchmarks, but benchmarks that test
- GHC extensions, so we can catch regressions - parallelism (see nofib/parallel) - concurrency (see nofib/smp) - the garbage collector (see nofib/gc)
I tend to like quantity over quality: it's very common to get just one benchmark in the whole suite that shows a regression or exercises a particular corner of the compiler or runtime. We should only keep benchmarks that have a tunable input size, however.
I would suggest that the first category might be made up of microbenchmarks, as I do not think it really is needed for performance per se. However, the other categories really need long-running benchmarks, that use (preferable) heaps of RAM, even when they're well tuned.
Criterion works best on programs that run for short periods of time, because it runs the benchmark at least 100 times, whereas for exercising the GC we really need programs that run for several seconds. I'm not sure how best to resolve this conflict.
I'm not sure about this. Given the fact that there's quite some non-determinism in modern CPUs and that computer systems seem to behave chaotically [1], I definitely see the need to employ Criterion for longer running applications as well. It might not need 100 executions, or multiple iterations per execution (incidentally, those iterations, can they be said to be independent?), but somewhere around 20 - 30 seems to be a minimum.
Meanwhile, I've been collecting pointers to interesting programs that cross my radar, in anticipation of waking up with an unexpectedly free week in which to pull together a benchmark suite... clearly overoptimistic! But I'll happily pass these pointers on to anyone with the inclination to do it.
I'm definitely interested. If I want to make a strong case for my current research, I really need benchmarks that can be used. Additionally, coming up with a good suite, characterising it, can easily result is a decent paper, that is certain to be cited numerous times. I think it would have to be a group/community effort though. I've looked through the apps on the Haskell wiki pages, but there's not much usable there, imho. I'd like to illustrate this by the dacapo benchmark suite [2,3] example. It took a while, but now everybody in the Java camp is (or should be) using these benchmarks. Saying that we just do not want to do this, is simply not plausible to maintain. -- Andy [1] Computer systems are dynamical systems, Todd Mytkowicz, Amer Diwan, and Elizabeth Bradley, Chaos 19, 033124 (2009); doi:10.1063/1.3187791 (14 pages). [2] The DaCapo benchmarks: java benchmarking development and analysis, Stephen Blackburn et al, OOPSLA 2006 [3] Wake up and smell the coffee: evaluation methodology for the 21st century, Stephen Blackburn et al, CACM 2008

On 25/06/2010 14:01, Andy Georges wrote:
Right. I have the distinct feeling this is a major lack in the Haskell world. SPEC evolved over time to include larger benchmarks that still excercise the various parts of the hardware, such that the benchmarks does not achieve suddenly a large improvement on a new architecture/implementation due to e.g. a larger cache and the working sets remain in the cache for the entire execution. The Haskell community has nothing that remotely resembles a decent suite. You could do experiments and show that over 10K iterations, the average execution time per iteration goes from 500ms to 450ms, but what does this really mean?
We have a need not just for plain Haskell benchmarks, but benchmarks that test
- GHC extensions, so we can catch regressions - parallelism (see nofib/parallel) - concurrency (see nofib/smp) - the garbage collector (see nofib/gc)
I tend to like quantity over quality: it's very common to get just one benchmark in the whole suite that shows a regression or exercises a particular corner of the compiler or runtime. We should only keep benchmarks that have a tunable input size, however.
I would suggest that the first category might be made up of microbenchmarks, as I do not think it really is needed for performance per se. However, the other categories really need long-running benchmarks, that use (preferable) heaps of RAM, even when they're well tuned.
The categories you mention aren't necessarily distinct: we have several microbenchmarks that run for a long time and use a lot of heap. For testing the GC, as with other parts of the system, we need both microbenchmarks and larger programs. Different people want different things from a benchmark suite: if you're demonstrating the efficacy of an optimisation or a language implementation, then you want just the "real" benchmarks, whereas if you're a compiler developer you probably want the microbenchmarks too, because investigating their performance tends to be more tractable, and the hope is that if you optimise all the microbenchmarks then the real programs will take care of themselves (it doesn't always work like that, but it's a good way to start). So I still very much like the approach taken by the venerable nofib suite where it includes not only the "real" programs, but also the microbenchmarks and the small programs; you don't have to use these in published results, but they're invaluable to us compiler developers, and having a shared framework for all the benchmarks makes things a lot easier. If we made it *really easy* for people to submit their own programs (e.g. using 'darcs send') then we might get a lot of contributions, from which we could cherry-pick for the "real" benchmark suite, while keeping most/all of the submissions for the "full" suite. Similarly, we should make it really easy for people to run the benchmark suite on their own machines and compilers - make the tools cabal-installable, with easy ways to generate results.
I'm definitely interested. If I want to make a strong case for my current research, I really need benchmarks that can be used. Additionally, coming up with a good suite, characterising it, can easily result is a decent paper, that is certain to be cited numerous times. I think it would have to be a group/community effort though. I've looked through the apps on the Haskell wiki pages, but there's not much usable there, imho. I'd like to illustrate this by the dacapo benchmark suite [2,3] example. It took a while, but now everybody in the Java camp is (or should be) using these benchmarks. Saying that we just do not want to do this, is simply not plausible to maintain.
Oh, don't get me wrong - we absolutely do want to do this, it's just difficult to get motivated to actually do it. It's great that you're interested, I'll help any way that I can, and I'll start by digging up some suggestions for benchmarks. Cheers, Simon
-- Andy
[1] Computer systems are dynamical systems, Todd Mytkowicz, Amer Diwan, and Elizabeth Bradley, Chaos 19, 033124 (2009); doi:10.1063/1.3187791 (14 pages). [2] The DaCapo benchmarks: java benchmarking development and analysis, Stephen Blackburn et al, OOPSLA 2006 [3] Wake up and smell the coffee: evaluation methodology for the 21st century, Stephen Blackburn et al, CACM 2008

I'm delighted that you are interested in this benchmarking stuff. Much needed. Thank you! | So I still very much like the approach taken by the venerable nofib | suite where it includes not only the "real" programs, but also the | microbenchmarks and the small programs; you don't have to use these in | published results, but they're invaluable to us compiler developers, and | having a shared framework for all the benchmarks makes things a lot easier. Yes yes. It's *essential* to retain the micro-benchmarks. They often show up in high relief a performance regression that would be hidden or much less prominent in a big program. The three-way split imaginary/spectral/real has served us well. Let's keep it! Simon

poor man's benchmark :)
http://github.com/nfjinjing/bench-euler
multi core aware, use bench-euler +RTS -N2 where 2 means 2 cores, and
watch your cpu fries :)
On Fri, Jun 25, 2010 at 7:24 AM, Andy Georges
Hi Simon et al,
I've picked up the HaBench/nofib/nobench issue again, needing a decent set of real applications to do some exploring of what people these days call split-compilation. We have a framework that was able to explore GCC optimisations [1] -- the downside there was the dependency of these optimisations on each other, requiring them to be done in certain order -- for a multi-objective search space, and extended this to exploring a JIT compiler [2] for Java in our case -- which posed its own problems. Going one step further, we'd like to explore the tradeoffs that can be made when compiling on different levels: source to bytecode (in some sense) and bytecode to native. Given that LLVM is quicly becoming a state-of-the-art framework and with the recent GHC support, we figured that Haskell would be an excellent vehicle to conduct our exploration and research (and the fact that some people at our lab have a soft spot for Haskell helps too). Which brings me back to benchmarks.
Are there any inputs available that allow the real part of the suite to run for a sufficiently long time? We're going to use criterion in any case given our own expertise with rigorous benchmarking [3,4], but since we've made a case in the past against short running apps on managed runtime systems [5], we'd love to have stuff that runs at least in the order of seconds, while doing useful things. All pointers are much appreciated.
Or if any of you out there have (recent) apps with inputs that are open source ... let us know.
-- Andy
[1] COLE: Compiler Optimization Level Exploration, Kenneth Hoste and Lieven Eeckhout, CGO 2008 [2] Automated Just-In-Time Compiler Tuning, Kenneth Hoste, Andy Georges and Lieven Eeckhout, CGO 2010 [3] Statistically Rigorous Java Performance Evaluation, Andy Georges, Dries Buytaert and Lieven Eeckhout, OOPSLA 2007 [4] Java Performance Evaluation through Rigorous Replay Compilation, Andy Georges, Lieven Eeckhout and Dries Buytaert, OOPSLA 2008 [5] How Java Programs Interact with Virtual Machines at the Microarchitectural Level, Lieven Eeckhout, Andy Georges, Koen De Bosschere, OOPSLA 2003
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- jinjing
participants (5)
-
Andy Georges
-
Jinjing Wang
-
Simon Marlow
-
Simon Michael
-
Simon Peyton-Jones