
I've been trying to get some speed out of the accelerate library today. What I want to implement is something as simple as a matrix multiply. I'd like it to be fast and memory efficient. Given the equation C = AB where A is nxr B is rxm C is nxm it seem reasonable to allocate three arrays on the GPU wiht n*r, r*m and n*m elements respectively. Anyone know how to achieve this with accelerate? My first thought was to use the generate function to create the new C array, but I didn't manage to wrap my head around all the fancy type features that pop up when you want to return an array C that has dimensions dependent on the dimensions of it's inputs, A and B. I've search around a bit and found this [1] example implementation but it is just as slow as a simple sequential algorithm in C. I would be very thankful for any advice for working with accelerate! Here's a snippet of what I have tried to make. There are several errors in there. Maybe I'm approaching the problem from the wrong angle. matMul' arr brr = let dotProd shp = let (Z :. rowsA :. _) = unlift (shape arr) :: (Z :. Exp Int :. Exp Int) (Z :. _ :. colsB) = unlift (shape brr) :: (Z :. Exp Int :. Exp Int) (Z :. i :. j) = unlift shp :: (Z :. Exp Int :. Exp Int) rs = lift (Z :. All :.) (unlift i) cs = (lift (Z :.) (unlift j)) (:. All) in the $ A.fold1All (+) $ A.zipWith (+) (flatten (slice arr rs)) (flatten (slice brr cs)) in A.generate (lift (Z :. rowsA :. colsB)) dotProd [1] http://www.mail-archive.com/haskell-cafe@haskell.org/msg102782.html