To avoid any confusion, this was a reply to the following email:

On Fri, Mar 13, 2015 at 6:23 PM, Anatoly Yakovenko <aeyakovenko@gmail.com> wrote:
https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8


so i am seeing basically results with N4 that are as good as using
sequential computation on my macbook for the matrix multiply
algorithm.  any idea why?

Thanks,
Anatoly

On Thu, Jan 14, 2016 at 8:19 PM, Thomas Miedema <thomasmiedema@gmail.com> wrote:
Anatoly: I also ran your benchmark, and can not reproduce your findings.

Note that GHC does not make effective use of hyperthreads (https://ghc.haskell.org/trac/ghc/ticket/9221#comment:12). So don't use -N4 when you have only a dual core machine. Maybe that's why you were getting bad results? I also notice a `NaN` in one of your timing results. I don't know how that is possible, or if it affected your results. Could you try running your benchmark again, but this time with -N2?

On Sat, Mar 14, 2015 at 5:21 PM, Carter Schonwald <carter.schonwald@gmail.com> wrote:
dense matrix product is not an algorithm that makes sense in repa's execution model, 

Matrix multiplication is the first example in the first repa paper: http://benl.ouroborus.net/papers/repa/repa-icfp2010.pdf. Look at figures 2 and 7.

    "we measured very good absolute speedup, ×7.2 for 8 cores, on multicore hardware"

Doing a quick experiment with 2 threads (my laptop doesn't have more cores):

$ cabal install repa-examples    # I did not bother with `-fllvm`
...

$ ~/.cabal/bin/repa-mmult -random 1024 1024 -random 1024 1204
elapsedTimeMS   = 6491

$ ~/.cabal/bin/repa-mmult -random 1024 1024 -random 1024 1204 +RTS -N2
elapsedTimeMS   = 3393

This is with GHC 7.10.3 and repa-3.4.0.1 (and dependencies from http://www.stackage.org/snapshot/lts-3.22)