
To avoid any confusion, this was a reply to the following email:
On Fri, Mar 13, 2015 at 6:23 PM, Anatoly Yakovenko
https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8
so i am seeing basically results with N4 that are as good as using sequential computation on my macbook for the matrix multiply algorithm. any idea why?
Thanks, Anatoly
On Thu, Jan 14, 2016 at 8:19 PM, Thomas Miedema
Anatoly: I also ran your benchmark, and can not reproduce your findings.
Note that GHC does not make effective use of hyperthreads ( https://ghc.haskell.org/trac/ghc/ticket/9221#comment:12). So don't use -N4 when you have only a dual core machine. Maybe that's why you were getting bad results? I also notice a `NaN` in one of your timing results. I don't know how that is possible, or if it affected your results. Could you try running your benchmark again, but this time with -N2?
On Sat, Mar 14, 2015 at 5:21 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
dense matrix product is not an algorithm that makes sense in repa's execution model,
Matrix multiplication is the first example in the first repa paper: http://benl.ouroborus.net/papers/repa/repa-icfp2010.pdf. Look at figures 2 and 7.
"we measured very good absolute speedup, ×7.2 for 8 cores, on multicore hardware"
Doing a quick experiment with 2 threads (my laptop doesn't have more cores):
$ cabal install repa-examples # I did not bother with `-fllvm` ...
$ ~/.cabal/bin/repa-mmult -random 1024 1024 -random 1024 1204 elapsedTimeMS = 6491
$ ~/.cabal/bin/repa-mmult -random 1024 1024 -random 1024 1204 +RTS -N2 elapsedTimeMS = 3393
This is with GHC 7.10.3 and repa-3.4.0.1 (and dependencies from http://www.stackage.org/snapshot/lts-3.22)