
On 25/06/2010 00:24, Andy Georges wrote:
I've picked up the HaBench/nofib/nobench issue again, needing a decent set of real applications to do some exploring of what people these days call split-compilation. We have a framework that was able to explore GCC optimisations [1] -- the downside there was the dependency of these optimisations on each other, requiring them to be done in certain order -- for a multi-objective search space, and extended this to exploring a JIT compiler [2] for Java in our case -- which posed its own problems. Going one step further, we'd like to explore the tradeoffs that can be made when compiling on different levels: source to bytecode (in some sense) and bytecode to native. Given that LLVM is quicly becoming a state-of-the-art framework and with the recent GHC support, we figured that Haskell would be an excellent vehicle to conduct our exploration and research (and the fact that some people at our lab have a soft spot for Haskell helps too). Which brings me back to benchmarks.
Are there any inputs available that allow the real part of the suite to run for a sufficiently long time? We're going to use criterion in any case given our own expertise with rigorous benchmarking [3,4], but since we've made a case in the past against short running apps on managed runtime systems [5], we'd love to have stuff that runs at least in the order of seconds, while doing useful things. All pointers are much appreciated.
The short answer is no, although some of the benchmarks have tunable input sizes (mainly the spectral ones) and you can 'make mode=slow' to run those with larger inputs. More generally, the nofib suite really needs an overhaul or replacement. Unfortunately it's a tiresome job and nobody really wants to do it. There have been various abortive efforts, including nobench and HaBench. Meanwhile we in the GHC camp continue to use nofib, mainly because we have some tool infrastructure set up to digest the results (nofib-analyse). Unfortunately nofib has steadily degraded in usefulness over time due to both faster processors and improvements in GHC, such that most of the programs now run for less than 0.1s and are ignored by the tools when calculating averages over the suite. We have a need not just for plain Haskell benchmarks, but benchmarks that test - GHC extensions, so we can catch regressions - parallelism (see nofib/parallel) - concurrency (see nofib/smp) - the garbage collector (see nofib/gc) I tend to like quantity over quality: it's very common to get just one benchmark in the whole suite that shows a regression or exercises a particular corner of the compiler or runtime. We should only keep benchmarks that have a tunable input size, however. Criterion works best on programs that run for short periods of time, because it runs the benchmark at least 100 times, whereas for exercising the GC we really need programs that run for several seconds. I'm not sure how best to resolve this conflict. Meanwhile, I've been collecting pointers to interesting programs that cross my radar, in anticipation of waking up with an unexpectedly free week in which to pull together a benchmark suite... clearly overoptimistic! But I'll happily pass these pointers on to anyone with the inclination to do it. Cheers, Simon
Or if any of you out there have (recent) apps with inputs that are open source ... let us know.
-- Andy
[1] COLE: Compiler Optimization Level Exploration, Kenneth Hoste and Lieven Eeckhout, CGO 2008 [2] Automated Just-In-Time Compiler Tuning, Kenneth Hoste, Andy Georges and Lieven Eeckhout, CGO 2010 [3] Statistically Rigorous Java Performance Evaluation, Andy Georges, Dries Buytaert and Lieven Eeckhout, OOPSLA 2007 [4] Java Performance Evaluation through Rigorous Replay Compilation, Andy Georges, Lieven Eeckhout and Dries Buytaert, OOPSLA 2008 [5] How Java Programs Interact with Virtual Machines at the Microarchitectural Level, Lieven Eeckhout, Andy Georges, Koen De Bosschere, OOPSLA 2003