proposal: HaBench, a Haskell Benchmark Suite

Hello, Following up and the threads on haskell and haskell-cafe, I'd like to gather ideas, comments and suggestions for a standarized Haskell Benchmark Suite. The idea is to gather a bunch of programs written in Haskell, and which are representative for the Haskell community (i.e. apps, libraries, ...). Following the example of SPEC (besides the fact that the SPEC benchmarks aren't available for free), we would like to build a database containing performance measurements for the various benchmarks in the suite. Users should be able to submit their results. This will hopefully stimulate people to take performance into account when writing a Haskell program/library, and will also serve as a valuable tool for further optimizing both applications written in Haskell and the various Haskell compilers out there (GHC, jhc, nhc, ...). This thread is meant to gather peoples thought on this subject. Which programs should we consider for the first version of the Haskell benchmark suite? How should we standarize them, and make them produce reliable performance measurement? Should we only use hardware performance counters, or also do more thorough analysis such as data locality studies, ... Are there any papers available on this subject (I know about the paper which is being written as we speak ICFP, which uses PAPI as a tool). I have created a HaskellWiki page (http://www.haskell.org/haskellwiki/ HaBench) in order to centralize ideas and suggestions. Feel free to add anything, and if you're willing to contribute to this project (in any way), add your name to the wiki for future reference. greetings, Kenneth (a.k.a. boegel) -- Statistics are like a bikini. What they reveal is suggestive, but what they conceal is vital (Aaron Levenstein) Kenneth Hoste ELIS - Ghent University kenneth.hoste@elis.ugent.be http://www.elis.ugent.be/~kehoste

| Following up and the threads on haskell and haskell-cafe, I'd like to | gather ideas, comments and suggestions for a standarized Haskell | Benchmark Suite. Great idea. Maybe this can subsume nofib. I recommend reading the nofib paper though: http://citeseer.ist.psu.edu/partain93nofib.html Simon

Kenneth Hoste wrote:
The idea is to gather a bunch of programs written in Haskell, and which are representative for the Haskell community (i.e. apps, libraries, ...).
A While ago I tried to write a Haskell version of John Harrops ray-tracer benchmark (http://www.ffconsultancy.com/free/ray_tracer/languages.html) but the performance was not very good (the OCaml version I based it on was at least 10x faster). I would be happy to contribute my code to the benchmark suite if you are interested. Perhaps someone can point out obvious speed-ups that I missed while trying to improve the performance. -- Alan Falloon

On Fri, Jan 26, 2007 at 10:17:28AM -0500, Al Falloon wrote:
Kenneth Hoste wrote:
The idea is to gather a bunch of programs written in Haskell, and which are representative for the Haskell community (i.e. apps, libraries, ...).
A While ago I tried to write a Haskell version of John Harrops ray-tracer benchmark (http://www.ffconsultancy.com/free/ray_tracer/languages.html) but the performance was not very good (the OCaml version I based it on was at least 10x faster).
I would be happy to contribute my code to the benchmark suite if you are interested. Perhaps someone can point out obvious speed-ups that I missed while trying to improve the performance.
I would think that what we'd want to benchmark would be clean, optimized actually-used code. I.e. things like Data.Bytestring, so that we could see how compilers differed on important code, or how the code generated on different architectures differed. e.g. if jhc beats ghc on amd64, the ghc developers would probably be very curious as to why, and how to fix it. Code that's not been properly optimized with respect to strictness, etc, would fail to focus the tests on important optimizations of the compiler. But of course, the benchmark code should also be clean, since we want to ensure that our compilers are good enough that we can write useful beautiful code that is also fast. Just my $0.02. -- David Roundy http://www.darcs.net

David Roundy wrote:
On Fri, Jan 26, 2007 at 10:17:28AM -0500, Al Falloon wrote:
The idea is to gather a bunch of programs written in Haskell, and which are representative for the Haskell community (i.e. apps, libraries, ...). A While ago I tried to write a Haskell version of John Harrops ray-tracer benchmark (http://www.ffconsultancy.com/free/ray_tracer/languages.html) but the
Kenneth Hoste wrote: performance was not very good (the OCaml version I based it on was at least 10x faster).
I would be happy to contribute my code to the benchmark suite if you are interested. Perhaps someone can point out obvious speed-ups that I missed while trying to improve the performance.
I would think that what we'd want to benchmark would be clean, optimized actually-used code. I.e. things like Data.Bytestring, so that we could see how compilers differed on important code, or how the code generated on different architectures differed. e.g. if jhc beats ghc on amd64, the ghc developers would probably be very curious as to why, and how to fix it.
Code that's not been properly optimized with respect to strictness, etc, would fail to focus the tests on important optimizations of the compiler. But of course, the benchmark code should also be clean, since we want to ensure that our compilers are good enough that we can write useful beautiful code that is also fast.
I tried to optimize it, but I couldn't approach the speed of the OCaml version. I followed the performance tuning advice from the Wiki, and had even resorted to writing the inner loop calculations using all unboxed doubles, but without significant improvements. This is exactly the kind of code that I write most often, and I would love to see improvements in the optimizations for this kind of numerically intensive code (especially without having to resort to compiler-specific unboxed representations). I agree that common libraries like ByteString need to be well represented, but the original request additionally included programs that are representative of applications. A ray-tracer (even with a fixed scene and only one type of scene primitive) is a fairly nice approximation of a real numerical batch-oriented application while still being small enough to understand and modify. I expect thats why John chose it as his benchmark application in the first place. I think it would also be a good idea to include SPJ's web server (I think its from an STM paper). A lot of the people outside the Haskell community will be able to relate better to metrics like pages/second.

Al Falloon wrote:
David Roundy wrote:
On Fri, Jan 26, 2007 at 10:17:28AM -0500, Al Falloon wrote:
The idea is to gather a bunch of programs written in Haskell, and which are representative for the Haskell community (i.e. apps, libraries, ...). A While ago I tried to write a Haskell version of John Harrops ray-tracer benchmark (http://www.ffconsultancy.com/free/ray_tracer/languages.html) but the
Kenneth Hoste wrote: performance was not very good (the OCaml version I based it on was at least 10x faster).
I would be happy to contribute my code to the benchmark suite if you are interested. Perhaps someone can point out obvious speed-ups that I missed while trying to improve the performance.
I would think that what we'd want to benchmark would be clean, optimized actually-used code. I.e. things like Data.Bytestring, so that we could see how compilers differed on important code, or how the code generated on different architectures differed. e.g. if jhc beats ghc on amd64, the ghc developers would probably be very curious as to why, and how to fix itere
Code that's not been properly optimized with respect to strictness, etc, would fail to focus the tests on important optimizations of the compiler. But of course, the benchmark code should also be clean, since we want to ensure that our compilers are good enough that we can write useful beautiful code that is also fast.
I tried to optimize it, but I couldn't approach the speed of the OCaml version. I followed the performance tuning advice from the Wiki, and had even resorted to writing the inner loop calculations using all unboxed doubles, but without significant improvements. This is exactly the kind of code that I write most often, and I would love to see improvements in the optimizations for this kind of numerically intensive code (especially without having to resort to compiler-specific unboxed representations).
I agree that common libraries like ByteString need to be well represented, but the original request additionally included programs that are representative of applications. A ray-tracer (even with a fixed scene and only one type of scene primitive) is a fairly nice approximation of a real numerical batch-oriented application while still being small enough to understand and modify. I expect thats why Jo chose it as his benchmark application in the first place.
Writing numeric code that processes Doubles is hard to optimize. See the shootout example for the n-body problem: http://haskell.org/haskellwiki/Shootout/Nbody http://shootout.alioth.debian.org/debian/benchmark.php?test=nbody&lang=all Some of the best work on making Haskell perform wonderfully at numerics is summarized at http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell and their status report http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html

On Fri, Jan 26, 2007 at 05:25:13PM +0000, Chris Kuklewicz wrote:
I agree that common libraries like ByteString need to be well represented, but the original request additionally included programs that are representative of applications. A ray-tracer (even with a fixed scene and only one type of scene primitive) is a fairly nice approximation of a real numerical batch-oriented application while still being small enough to understand and modify. I expect thats why Jo chose it as his benchmark application in the first place.
Writing numeric code that processes Doubles is hard to optimize. See the shootout example for the n-body problem:
I agree that numerics and Doubles are very important, but am of the opinion that we'll be better off if we (try to?) restrict ourselves to code that is really used by someone who really cares about performance enough to optimize it. Mostly because if the benchmark suite is going to serve as a useful guide for compiler optimization writers, it needs to reflect the performance of code that at least *someone* cares about. Artificial benchmarks all too often can be improved by artificial optimizations that only benefit artificial benchmarks. e.g. the famous specmark log(sqrt(x)) optimization (which is equal to 0.5*log(x), but no decent programmer would ever write that code, and it's a special case the compiler shouldn't bother looking for). -- David Roundy http://www.darcs.net

Hello David, Saturday, January 27, 2007, 8:48:39 PM, you wrote:
I agree that numerics and Doubles are very important, but am of the opinion that we'll be better off if we (try to?) restrict ourselves to code that is really used by someone who really cares about performance enough to optimize it.
he should write an imperative code and in last Feb we've discussed why GHC don't generate good code for low-level imperative Haskell programs. you can see this discussion in GHC maillist. the problem is well known since GHC birth and GHC HQ plan to fix it by writing their own low-level optimizer. of course, it's hard to approach gcc, but results close to Ocaml may be accomplished; the only problem is lack of spare developer's hands :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Wed, Jan 31, 2007 at 02:58:20AM +0300, Bulat Ziganshin wrote:
Hello David,
Saturday, January 27, 2007, 8:48:39 PM, you wrote:
I agree that numerics and Doubles are very important, but am of the opinion that we'll be better off if we (try to?) restrict ourselves to code that is really used by someone who really cares about performance enough to optimize it.
he should write an imperative code and in last Feb we've discussed why GHC don't generate good code for low-level imperative Haskell programs. you can see this discussion in GHC maillist. the problem is well known since GHC birth and GHC HQ plan to fix it by writing their own low-level optimizer. of course, it's hard to approach gcc, but results close to Ocaml may be accomplished; the only problem is lack of spare developer's hands :)
It all depends on what the someone is doing. If he's doing linear algebra, for example, he shouldn't be writing low-level code for anything but O(N) operations in any language, because that's stupid. And factors of two in the speed of O(N) operations don't much matter. But it's still nice to see that Haskell implementations can do a decent job at O(N) applications, and there are certainly cases a clean optimized Haskell library should be able to outperform a clean optimized C++ library (if you can agree with me that template expressions are dirty), due to the possibility of array fusion. -- David Roundy Department of Physics Oregon State University

At Fri, 26 Jan 2007 07:23:26 -0800, David Roundy wrote:
I would think that what we'd want to benchmark would be clean, optimized actually-used code.
Maybe there should be two versions of each benchmark: 1) an clean, simple, idiomatic version, aka the code we would like to write if performance was not an issue. 2) a super-hand optimized version This would hopefully show, which benchmarks have the biggest room for improvement. And, it would provide information to the people working on the compilers as to what optimizations are actually needed for that test case. j.

Hello David, Friday, January 26, 2007, 6:23:26 PM, you wrote:
performance was not very good (the OCaml version I based it on was at least 10x faster).
I would think that what we'd want to benchmark would be clean, optimized actually-used code. I.e. things like Data.Bytestring, so that we could see
so you propose to benchmark only low-level imperative code? :) (correct me if i'm wrong, but everything fast i've seen, including FPS, is just imperative code) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Wed, Jan 31, 2007 at 01:56:32AM +0300, Bulat Ziganshin wrote:
Hello David,
Friday, January 26, 2007, 6:23:26 PM, you wrote:
performance was not very good (the OCaml version I based it on was at least 10x faster).
I would think that what we'd want to benchmark would be clean, optimized actually-used code. I.e. things like Data.Bytestring, so that we could see
so you propose to benchmark only low-level imperative code? :)
(correct me if i'm wrong, but everything fast i've seen, including FPS, is just imperative code)
I guess it depends what you mean by imperative, and I'll admit I haven't looked at Data.Bytestring's internals recently, but most of the code in FastPackedString (its predecessor) is just using a ForeignPtr as an array, which isn't imperative at all. Certainly it (almost?) never uses a mutable variable. So what do you mean by "imperative"? In any case, I meant code that *uses* Data.Bytestring, which is certainly purely functional. -- David Roundy Department of Physics Oregon State University

Hi,
Following up and the threads on haskell and haskell-cafe, I'd like to gather ideas, comments and suggestions for a standarized Haskell Benchmark Suite.
The idea is to gather a bunch of programs written in Haskell, and which are representative for the Haskell community (i.e. apps, libraries, ...). Following the example of SPEC (besides the fact that the SPEC benchmarks aren't available for free), we would like to build a database containing performance measurements for the various benchmarks in the suite. Users should be able to submit their results. This will hopefully stimulate people to take performance into account when writing a Haskell program/library, and will also serve as a valuable tool for further optimizing both applications written in Haskell and the various Haskell compilers out there (GHC, jhc, nhc, ...).
This thread is meant to gather peoples thought on this subject. Which programs should we consider for the first version of the Haskell benchmark suite? How should we standarize them, and make them produce reliable performance measurement? Should we only use hardware performance counters, or also do more thorough analysis such as data locality studies, ... Are there any papers available on this subject (I know about the paper which is being written as we speak ICFP, which uses PAPI as a tool).
I think that we should have, as David Roundy pointed out, a restriction to code that is actually used frequently. However, I think we should make a distinction between micro-benchmarks, that test some specific item, and real-life benchmarks. When using micro benchmarks, the wrong conclusions may be drawn, because e.g., code or data can be completely cached, there are no TLB misses after startup, etc. I think that is somebody is interested in knowing how Haskell performs, and if he should use it for his development, it is nice to know that e.g., Data.ByteString performs as good as C, but is would be even nicer to see that large, real-life apps can reach that same performance. There is more to the Haskell runtime than simply executing application code, and these things should also be taken into account. Also, I think that having several compilers for the benchmark set is a good idea, because, afaik, they can provide a different runtime system as well. We know that in Java, the VM can have a significant impact on behaviour on the microprocessor. I think that Haskell may have similar issues. Also, similar to SPEC CPU, it would be nice to have input sets for each benchmark that gets included into the set. Furthermore, I think that we should provide a rigorous analysis of the benchmarks on as many platforms as is feasible. See e.g., the analysis done for the Dacapo Java benchmark suite, published at OOPSLA 2006. -- Andy

On Jan 28, 2007, at 8:51 AM, Andy Georges wrote:
it is nice to know that e.g., Data.ByteString performs as good as C, but is would be even nicer to see that large, real-life apps can reach that same performance.
What about using darcs as a benchmark? I heard people say it's slow. The undercurrent is that it's slow because it's written in Haskell. -- http://wagerlabs.com/

Hi
What about using darcs as a benchmark? I heard people say it's slow. The undercurrent is that it's slow because it's written in Haskell.
Its slow because some algorithms are O(stupid value). Some operations (I've been told) would take 100's of years to terminate. That has nothing to do with Haskell. Also its not written in Haskell, its written in GHC-Haskell. This benchmark suite should try where possible to be a Haskell benchmark. Thanks Neil

On 28 Jan 2007, at 12:57, Joel Reymont wrote:
On Jan 28, 2007, at 8:51 AM, Andy Georges wrote:
it is nice to know that e.g., Data.ByteString performs as good as C, but is would be even nicer to see that large, real-life apps can reach that same performance.
What about using darcs as a benchmark? I heard people say it's slow. The undercurrent is that it's slow because it's written in Haskell.
I have pondered about that. What would the input set be? And how to repeatedly run the benchmark? Should we just have a recording phase? Or a diff phase? It seems difficult to have a VC system as a benchmark. -- Andy

On Sun, Jan 28, 2007 at 10:36:50PM +0100, Andy Georges wrote:
On 28 Jan 2007, at 12:57, Joel Reymont wrote:
On Jan 28, 2007, at 8:51 AM, Andy Georges wrote:
it is nice to know that e.g., Data.ByteString performs as good as C, but is would be even nicer to see that large, real-life apps can reach that same performance.
What about using darcs as a benchmark? I heard people say it's slow. The undercurrent is that it's slow because it's written in Haskell.
I have pondered about that. What would the input set be? And how to repeatedly run the benchmark? Should we just have a recording phase? Or a diff phase? It seems difficult to have a VC system as a benchmark.
We darcs folk would love to have a darcs benchmark, and Jason has even put some work into a simulator driver (which would call various darcs commands to build up a repository). But I don't think this would be useful as a Haskell benchmark. But we'd love to have automatic performance regression testing! Better (for Haskell benchmarking) would be to take a small (but relevant) part of darcs, and benchmark that. Which is effectively what a Data.ByteString benchmark could do. Darcs spends vast amounts of time breaking huge files into lines (or it's been doing that for me recently) and running its LCS algorithm on those lists of lines. This wouldn't be a bad benchmark: just a huge LCS job. You could algorithmically generate two huge strings and then compute their LCS. Actually, darcs no longer uses a true LCS (neither does diff), so you might try our LCS substitute. -- David Roundy http://www.darcs.net

| I think that we should have, as David Roundy pointed out, a | restriction to code that is actually used frequently. However, I | think we should make a distinction between micro-benchmarks, that | test some specific item, and real-life benchmarks. As many of you will know, the nofib benchmark suite made just such a distinction from the beginning. In fact, it has 3 sub-suites: * Imaginary: very tiny programs whose merit is that they sometimes hit missing optimisations very hard indeed. Useful for compiler writers. * Spectral: these are often called "kernels" and are meant to be typical of the inner loops of some real programs. * Real: these are unadulterated real applications, of various sizes. We found these categories to be useful and robust, and I think they'd be useful for the new suite. In particular, the imaginary suite is useless for (say) choosing a compiler, but fantastic for exposing particular weak spots. But if the imaginary programs were mixed with the real ones, the whole thing would lose credibility. You can find more info here: http://research.microsoft.com/~simonpj/Papers/nofib.ps Simon

On 29/01/07, Simon Peyton-Jones
We found these categories to be useful and robust, and I think they'd be useful for the new suite. In particular, the imaginary suite is useless for (say) choosing a compiler, but fantastic for exposing particular weak spots. But if the imaginary programs were mixed with the real ones, the whole thing would lose credibility.
One thing I'd like to see, but I have no idea if it would be practical, is an attempt to distinguish between "idiomatic" and "highly optimised" benchmarks. Given that optimising Haskell code is a complex problem - and often unfamiliar to programmers coming from other backgrounds - this would be useful for isolating how well a compiler does at dealing with code which is *not* written purely for speed. As I say, though, this may be impossible to achieve in practice (although something may be possible at the "imaginary program" level). Paul.
participants (11)
-
Al Falloon
-
Andy Georges
-
Bulat Ziganshin
-
Chris Kuklewicz
-
David Roundy
-
Jeremy Shaw
-
Joel Reymont
-
Kenneth Hoste
-
Neil Mitchell
-
Paul Moore
-
Simon Peyton-Jones