dense matrix product is not an algorithm that makes sense in repa's execution model,
in square matrix multiply of two N x N matrices, each result entry depends on 2n values total across the two input matrices.
even then, thats actually the wrong way to parallelize dense matrix product! its worth reading the papers about goto blas and the more recent blis project. a high performance dense matrix multipy winds up needing to do some nested array parallelism with mutable updates to have efficient sharing of sub computations!