
Not sure what changed, but after rerunning it I get expected results:
anatolys-MacBook:rbm anatolyy$ dist/build/proto/proto +RTS -N2
benchmarking P
time 1.791 s (1.443 s .. 2.304 s)
0.991 R² (0.974 R² .. 1.000 R²)
mean 1.803 s (1.750 s .. 1.855 s)
std dev 90.06 ms (0.0 s .. 90.90 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking S
time 3.225 s (2.685 s .. 3.837 s)
0.996 R² (0.985 R² .. 1.000 R²)
mean 3.033 s (2.857 s .. 3.142 s)
std dev 165.0 ms (0.0 s .. 188.7 ms)
variance introduced by outliers: 19% (moderately inflated)
perf log written to dist/perf-mmult.html
anatolys-MacBook:rbm anatolyy$ dist/build/proto/proto +RTS -N4
benchmarking P
time 1.851 s (1.326 s .. 2.316 s)
0.990 R² (0.964 R² .. 1.000 R²)
mean 1.784 s (1.693 s .. 1.901 s)
std dev 106.3 ms (0.0 s .. 119.8 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking S
time 3.329 s (3.041 s .. 3.944 s)
0.996 R² (0.993 R² .. 1.000 R²)
mean 3.173 s (3.100 s .. 3.244 s)
std dev 119.6 ms (0.0 s .. 121.9 ms)
variance introduced by outliers: 19% (moderately inflated)
perf log written to dist/perf-mmult.html
anatolys-MacBook:rbm anatolyy$ dist/build/proto/proto +RTS -N
benchmarking P
time 1.717 s (1.654 s .. 1.830 s)
0.999 R² (0.999 R² .. 1.000 R²)
mean 1.717 s (1.701 s .. 1.728 s)
std dev 16.64 ms (0.0 s .. 19.20 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking S
time 3.127 s (3.079 s .. 3.222 s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 3.105 s (3.094 s .. 3.116 s)
std dev 18.12 ms (543.9 as .. 18.50 ms)
variance introduced by outliers: 19% (moderately inflated)
perf log written to dist/perf-mmult.html
On Thu, Jan 14, 2016 at 11:22 AM Thomas Miedema
To avoid any confusion, this was a reply to the following email:
On Fri, Mar 13, 2015 at 6:23 PM, Anatoly Yakovenko
wrote: https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8
so i am seeing basically results with N4 that are as good as using sequential computation on my macbook for the matrix multiply algorithm. any idea why?
Thanks, Anatoly
On Thu, Jan 14, 2016 at 8:19 PM, Thomas Miedema
wrote: Anatoly: I also ran your benchmark, and can not reproduce your findings.
Note that GHC does not make effective use of hyperthreads ( https://ghc.haskell.org/trac/ghc/ticket/9221#comment:12). So don't use -N4 when you have only a dual core machine. Maybe that's why you were getting bad results? I also notice a `NaN` in one of your timing results. I don't know how that is possible, or if it affected your results. Could you try running your benchmark again, but this time with -N2?
On Sat, Mar 14, 2015 at 5:21 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
dense matrix product is not an algorithm that makes sense in repa's execution model,
Matrix multiplication is the first example in the first repa paper: http://benl.ouroborus.net/papers/repa/repa-icfp2010.pdf. Look at figures 2 and 7.
"we measured very good absolute speedup, ×7.2 for 8 cores, on multicore hardware"
Doing a quick experiment with 2 threads (my laptop doesn't have more cores):
$ cabal install repa-examples # I did not bother with `-fllvm` ...
$ ~/.cabal/bin/repa-mmult -random 1024 1024 -random 1024 1204 elapsedTimeMS = 6491
$ ~/.cabal/bin/repa-mmult -random 1024 1024 -random 1024 1204 +RTS -N2 elapsedTimeMS = 3393
This is with GHC 7.10.3 and repa-3.4.0.1 (and dependencies from http://www.stackage.org/snapshot/lts-3.22)