
I suspect that it is your initialization that is the difference. For one thing, you've initialized the arrays to different values, and in your C code you've fused what are two separate loops in your Haskell code. So you've not only given the C compiler an easier loop to run (since you're initializing the array to a constant rather than to a sequence of numbers), but you've also manually optimized that initialization. In fact, this fusion could be precisely the factor of two. Why not see what happens in Haskell if you create just one vector and dot it with itself? (of course, that'll also make the blas call faster, so you'll need to be careful in your interpretation of your results.)
The difference cant be in the initialization. I am calling the dot
product a million times, the malloc and init in both cases are
insignificant. Also, "fusing" the two loops in C probably wont help,
if anything having each loop run separate is likely to be faster and
result in less cache misses.
In this case, i am using vectors of size 10 only, and calling the loop
10 million times, haskell is far far slower, or 35 times. That's
pretty crappy.
$ cat htestdot.hs
{-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
-fglasgow-exts -fbang-patterns -lcblas#-}
module Main where
import Data.Vector.Dense.IO
import Control.Monad
main = do
let size = 10
let times = 10*1000*1000
v1::IOVector Int Double <- newListVector size $ replicate size 0.1
v2::IOVector Int Double <- newListVector size $ replicate size 0.1
sum <- foldM (\ ii zz -> do
rv <- v1 `getDot` v2
return $ zz + rv
) 0.0 [0..times]
print $ sum
$ ghc --make htestdot.hs
$ time ./htestdot
1.00000001e7
real 0m17.328s
user 0m17.320s
sys 0m0.010
$ cat testdot.c
#include