How to correctly benchmark code with Criterion?

Dear list, during past few days I spent a lot of time trying to figure out how to write Criterion benchmarks, so that results don't get skewed by lazy evaluation. I want to benchmark different versions of an algorithm doing numerical computations on a vector. For that I need to create an input vector containing a few thousand elements. I decided to create random data, but that really doesn't matter - I could have as well use infinite lists instead of random ones. My problem is that I am not certain if I am creating my benchmark correctly. I wrote a function that creates data like this: dataBuild :: RandomGen g => g -> ([Double], [Double]) dataBuild gen = (take 6 $ randoms gen, take 2048 $ randoms gen) And I create benchmark like this: bench "Lists" $ nf L.benchThisFunction (L.dataBuild gen) The question is how to generate data so that its evaluation won't be included in the benchmark. I already asked this question on StackOverflow ( http://stackoverflow.com/questions/12896235/how-to-create-data-for-criterion... ) and got answer to use evaluate + force. After spending one day on testing this approach I came to conclusion that doing this does not seem to influence results of a benchmark in any way (I did stuf like unsagePerformIO + delayThread). On the other hand I looked into sources of criterion and I see that the benchmark code is run like this: evaluate (rnf (f x)) I am a Haskell newbie and perhaps don't interpret this correctly, but to me it looks as though criterion did not evaluate the possibly non-evaluated parameter x before running the benchmark, but instead evaluates the final result. Can someone provide an explanation on how this exactly works and how should I write my benchmarks so that results are correct? Janek

Hi Janek, On 18/10/12 10:23, Janek S. wrote:
during past few days I spent a lot of time trying to figure out how to write Criterion benchmarks, so that results don't get skewed by lazy evaluation. I want to benchmark different versions of an algorithm doing numerical computations on a vector. For that I need to create an input vector containing a few thousand elements. I decided to create random data, but that really doesn't matter - I could have as well use infinite lists instead of random ones.
[snip]
The question is how to generate data so that its evaluation won't be included in the benchmark.
Something like this might work, not sure what the canonical way is. ---8<--- main = do ... let input = L.dataBuild gen evaluate (rnf input) defaultMain ... bench "Lists" $ nf L.benchThisFunction input ... ---8<--- I did use something like this in practice here: https://gitorious.org/bitwise/bitwise/blobs/master/extra/benchmark.hs#line15... Thanks, Claude -- http://mathr.co.uk

I don't know if you have already read them,
but Tibell's slides on High Performance Haskell are pretty good:
http://www.slideshare.net/tibbe/highperformance-haskell
There is a section at the end where he runs several tests using Criterion.
HTH,
A.
On 18 October 2012 11:45, Claude Heiland-Allen
Hi Janek,
On 18/10/12 10:23, Janek S. wrote:
during past few days I spent a lot of time trying to figure out how to write Criterion benchmarks, so that results don't get skewed by lazy evaluation. I want to benchmark different versions of an algorithm doing numerical computations on a vector. For that I need to create an input vector containing a few thousand elements. I decided to create random data, but that really doesn't matter - I could have as well use infinite lists instead of random ones.
[snip]
The question is how to generate data so that its evaluation won't be
included in the benchmark.
Something like this might work, not sure what the canonical way is.
---8<--- main = do ... let input = L.dataBuild gen evaluate (rnf input) defaultMain ... bench "Lists" $ nf L.benchThisFunction input ... ---8<---
I did use something like this in practice here:
https://gitorious.org/bitwise/**bitwise/blobs/master/extra/** benchmark.hs#line155https://gitorious.org/bitwise/bitwise/blobs/master/extra/benchmark.hs#line15...
Thanks,
Claude -- http://mathr.co.uk
______________________________**_________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/**mailman/listinfo/haskell-cafehttp://www.haskell.org/mailman/listinfo/haskell-cafe

Something like this might work, not sure what the canonical way is. (...)
This is basically the same as the answer I was given on SO. My concerns about this solutions are: - rnf requires its parameter to belong to NFData type class. This is not the case for some data structures like Repa arrays. - evaluate only evaluates its argument to WHNF - is this enough? If I have a tuple containing two lists won't this only evaluate the tuple construtor and leave the lists as thunks? This is actually the case in my code. As I said previously, it seems that Criterion somehow evaluates the data so that time needed for its creation is not included in the benchmark. I modified my dataBuild function to look lik this: dataBuild gen = unsafePerformIO $ do let x = (take 6 $ randoms gen, take 2048 $ randoms gen) delayThread 1000000 return x When I ran the benchmark, criterion estimated the time needed to complete it to over 100 seconds (which means that delayThread worked and was used as a basis for estimation), but the benchamrk was finished much faster and there was no difference in the final result comparing to the normal dataBuild function. This suggests that once data was created and used for estimation, the dataBuild function was not used again. The main question is: is this observation correct? In this question on SO: http://stackoverflow.com/questions/6637968/how-to-use-criterion-to-measure-p... one of the aswers says that there is no automatic memoization, while it looks that in fact the values of dataBuild are memoized. I have a feeling that I am misunderstanding something.
I don't know if you have already read them, but Tibell's slides on High Performance Haskell are pretty good:
http://www.slideshare.net/tibbe/highperformance-haskell
There is a section at the end where he runs several tests using Criterion. I skimmed the slides and slide 59 seems to show that my concerns regarding WHNF might be true.
Janek

On 18 October 2012 13:15, Janek S.
Something like this might work, not sure what the canonical way is. (...)
This is basically the same as the answer I was given on SO. My concerns about this solutions are: - rnf requires its parameter to belong to NFData type class. This is not the case for some data structures like Repa arrays.
For unboxed arrays of primitive types WHNF = NF. That is, once the array is constructed all its elements will be in WHNF.
- evaluate only evaluates its argument to WHNF - is this enough? If I have a tuple containing two lists won't this only evaluate the tuple construtor and leave the lists as thunks? This is actually the case in my code.
That is why you use "rnf" from the NFData type class. You use "evaluate" to kick-start rnf which then goes ahead and evaluates everything (assuming the NFData instance has been defined correctly.)
As I said previously, it seems that Criterion somehow evaluates the data so that time needed for its creation is not included in the benchmark. I modified my dataBuild function to look lik this:
dataBuild gen = unsafePerformIO $ do let x = (take 6 $ randoms gen, take 2048 $ randoms gen) delayThread 1000000 return x
When I ran the benchmark, criterion estimated the time needed to complete it to over 100 seconds (which means that delayThread worked and was used as a basis for estimation), but the benchamrk was finished much faster and there was no difference in the final result comparing to the normal dataBuild function. This suggests that once data was created and used for estimation, the dataBuild function was not used again. The main question is: is this observation correct? In this question on SO: http://stackoverflow.com/questions/6637968/how-to-use-criterion-to-measure-p... one of the aswers says that there is no automatic memoization, while it looks that in fact the values of dataBuild are memoized. I have a feeling that I am misunderstanding something.
If you bind an expression to a variable and then reuse that variable, the expression is only evaluated once. That is, in "let x = expr in ..." the expression is only evaluated once. However, if you have "f y = let x = expr in ..." then the expression is evaluated once per function call.
I don't know if you have already read them, but Tibell's slides on High Performance Haskell are pretty good:
http://www.slideshare.net/tibbe/highperformance-haskell
There is a section at the end where he runs several tests using Criterion. I skimmed the slides and slide 59 seems to show that my concerns regarding WHNF might be true.
It's usually safe if you benchmark a function. However, you most likely want the result to be in normal form. The "nf" does this for you. So, if your benchmark function has type "f :: X -> ([Double], Double)", your benchmark will be: bench "f" (nf f input) The first run will evaluate the input (and discard the runtime) and all subsequent runs will evaluate the result to normal form. For repa you can use deepSeqArray [1] if your array is not unboxed: bench "f'" (whnf (deepSeqArray . f) input) [1]: http://hackage.haskell.org/packages/archive/repa/3.2.2.2/doc/html/Data-Array...

Thank you very much Thomas. This is the kind of explanation I needed! Janek Dnia czwartek, 18 października 2012, Thomas Schilling napisał:
On 18 October 2012 13:15, Janek S.
wrote: Something like this might work, not sure what the canonical way is. (...)
This is basically the same as the answer I was given on SO. My concerns about this solutions are: - rnf requires its parameter to belong to NFData type class. This is not the case for some data structures like Repa arrays.
For unboxed arrays of primitive types WHNF = NF. That is, once the array is constructed all its elements will be in WHNF.
- evaluate only evaluates its argument to WHNF - is this enough? If I have a tuple containing two lists won't this only evaluate the tuple construtor and leave the lists as thunks? This is actually the case in my code.
That is why you use "rnf" from the NFData type class. You use "evaluate" to kick-start rnf which then goes ahead and evaluates everything (assuming the NFData instance has been defined correctly.)
As I said previously, it seems that Criterion somehow evaluates the data so that time needed for its creation is not included in the benchmark. I modified my dataBuild function to look lik this:
dataBuild gen = unsafePerformIO $ do let x = (take 6 $ randoms gen, take 2048 $ randoms gen) delayThread 1000000 return x
When I ran the benchmark, criterion estimated the time needed to complete it to over 100 seconds (which means that delayThread worked and was used as a basis for estimation), but the benchamrk was finished much faster and there was no difference in the final result comparing to the normal dataBuild function. This suggests that once data was created and used for estimation, the dataBuild function was not used again. The main question is: is this observation correct? In this question on SO: http://stackoverflow.com/questions/6637968/how-to-use-criterion-to-measur e-performance-of-haskell-programs one of the aswers says that there is no automatic memoization, while it looks that in fact the values of dataBuild are memoized. I have a feeling that I am misunderstanding something.
If you bind an expression to a variable and then reuse that variable, the expression is only evaluated once. That is, in "let x = expr in ..." the expression is only evaluated once. However, if you have "f y = let x = expr in ..." then the expression is evaluated once per function call.
I don't know if you have already read them, but Tibell's slides on High Performance Haskell are pretty good:
http://www.slideshare.net/tibbe/highperformance-haskell
There is a section at the end where he runs several tests using Criterion.
I skimmed the slides and slide 59 seems to show that my concerns regarding WHNF might be true.
It's usually safe if you benchmark a function. However, you most likely want the result to be in normal form. The "nf" does this for you. So, if your benchmark function has type "f :: X -> ([Double], Double)", your benchmark will be:
bench "f" (nf f input)
The first run will evaluate the input (and discard the runtime) and all subsequent runs will evaluate the result to normal form. For repa you can use deepSeqArray [1] if your array is not unboxed:
bench "f'" (whnf (deepSeqArray . f) input)
[1]: http://hackage.haskell.org/packages/archive/repa/3.2.2.2/doc/html/Data-Arra y-Repa.html#v:deepSeqArray

On Thu, Oct 18, 2012 at 4:23 AM, Janek S.
Dear list,
during past few days I spent a lot of time trying to figure out how to write Criterion benchmarks, so that results don't get skewed by lazy evaluation. I want to benchmark different versions of an algorithm doing numerical computations on a vector. For that I need to create an input vector containing a few thousand elements. I decided to create random data, but that really doesn't matter - I could have as well use infinite lists instead of random ones.
My problem is that I am not certain if I am creating my benchmark correctly. I wrote a function that creates data like this:
dataBuild :: RandomGen g => g -> ([Double], [Double]) dataBuild gen = (take 6 $ randoms gen, take 2048 $ randoms gen)
And I create benchmark like this:
bench "Lists" $ nf L.benchThisFunction (L.dataBuild gen)
The argument value will be evaluated by the first run of the bench-mark, and then laziness will keep the value around for the next few hundred runs that the "bench" function performs. So the evaluation will be included in the benchmark, but if "bench" is doing enough trials it will be statistical noise. Antoine

So the evaluation will be included in the benchmark, but if "bench" is doing enough trials it will be statistical noise. When I intentionally delayed my dataBuild function (using delayThread 1000000) the estimated time of benchmark was incorrect, but when I got the final results all runs were below 50ms, which means that initial run that took 1 second was discarded. So it seems to me that the first evaluation is discarded. Would be good if someone could definitely confirm that.
Janek

On Thu, Oct 18, 2012 at 9:06 AM, Janek S.
So the evaluation will be included in the benchmark, but if "bench" is doing enough trials it will be statistical noise.
When I intentionally delayed my dataBuild function (using delayThread 1000000) the estimated time of benchmark was incorrect, but when I got the final results all runs were below 50ms, which means that initial run that took 1 second was discarded. So it seems to me that the first evaluation is discarded. Would be good if someone could definitely confirm that.
You could email Bryan, the author of criterion.
Janek

Yes, Criterion always discards the time of the first evaluation.
On 18 October 2012 15:06, Janek S.
So the evaluation will be included in the benchmark, but if "bench" is doing enough trials it will be statistical noise. When I intentionally delayed my dataBuild function (using delayThread 1000000) the estimated time of benchmark was incorrect, but when I got the final results all runs were below 50ms, which means that initial run that took 1 second was discarded. So it seems to me that the first evaluation is discarded. Would be good if someone could definitely confirm that.
Janek
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Push the envelope. Watch it bend.
participants (5)
-
Alfredo Di Napoli
-
Antoine Latter
-
Claude Heiland-Allen
-
Janek S.
-
Thomas Schilling