Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

28 Jun 2008

      ...
I suspect that it is your initialization that is the difference.  For
one thing, you've initialized the arrays to different values, and in
your C code you've fused what are two separate loops in your Haskell
code.  So you've not only given the C compiler an easier loop to run
(since you're initializing the array to a constant rather than to a
sequence of numbers), but you've also manually optimized that
initialization.  In fact, this fusion could be precisely the factor of
two.  Why not see what happens in Haskell if you create just one
vector and dot it with itself? (of course, that'll also make the blas
call faster, so you'll need to be careful in your interpretation of
your results.)
The difference cant be in the initialization.   I am calling the dot
product a million times, the malloc and init in both cases are
insignificant.  Also, "fusing" the two loops in C probably wont help,
if anything having each loop run separate is likely to be faster and
result in less cache misses.

In this case, i am using vectors of size 10 only, and calling the loop
10 million times, haskell is far far slower, or 35 times.  That's
pretty crappy.

$ cat htestdot.hs
{-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
-fglasgow-exts -fbang-patterns -lcblas#-}
module Main where

import Data.Vector.Dense.IO
import Control.Monad

main = do
   let size = 10
   let times = 10*1000*1000
   v1::IOVector Int Double <- newListVector size $ replicate size 0.1
   v2::IOVector Int Double <- newListVector size $ replicate size 0.1
   sum <- foldM (\ ii zz -> do
      rv <- v1 `getDot` v2
      return $ zz + rv
      ) 0.0 [0..times]
   print $ sum

$ ghc --make htestdot.hs
$ time ./htestdot
1.00000001e7

real    0m17.328s
user    0m17.320s
sys     0m0.010

$ cat testdot.c
#include 
#include 
#include 
#include 

int main() {
   int size = 10;
   int times = 10*1000*1000;
   int ii = 0;
   double dd = 0.0;
   double* v1 = malloc(sizeof(double) * (size));
   double* v2 = malloc(sizeof(double) * (size));
   for(ii = 0; ii < size; ++ii) {
      v1[ii] = 0.1;
   }
   for(ii = 0; ii < size; ++ii) {
      v2[ii] = 0.1;
   }
   for(ii = 0; ii < times; ++ii) {
      dd += cblas_ddot(size, v1, 1, v2, 1);
   }
   free(v1);
   free(v2);
   printf("%f\n", dd);
   return 0;
}

$ gcc -O2 testdot.c -lcblas -o testdot
$ time ./testdot
999999.999839

real    0m0.491s
user    0m0.480s
sys     0m0.020s

Just to make sure that fold isnt causing the slowdown, i reverted the
haskell program to use the mapM_, i still got almost the same
performance:

$ cat htestdot.hs
{-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
-fglasgow-exts -fbang-patterns -lcblas#-}
module Main where

import Data.Vector.Dense.IO
import Control.Monad

main = do
   let size = 10
   let times = 10*1000*1000
   v1::IOVector Int Double <- newListVector size $ replicate size 0.1
   v2::IOVector Int Double <- newListVector size $ replicate size 0.1
   mapM_ (\ ii -> do v1 `getDot` v2) [0..times]
$ ghc --make htestdot

$ time ./htestdot

real    0m15.660s
user    0m15.630s
sys     0m0.030s

This is what the profiler has to say:

 $ cat htestdot.prof
        Fri Jun 27 18:06 2008 Time and Allocation Profiling Report  (Final)

           htestdot +RTS -p -RTS

        total time  =       22.00 secs   (1100 ticks @ 20 ms)
        total alloc = 3,320,010,716 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

main                           Main                 100.0  100.0

                        individual    inherited
COST CENTRE              MODULE
       no.    entries  %time %alloc   %time %alloc

MAIN                     MAIN
         1           0   0.0    0.0   100.0  100.0
 main                    Main
       222           1  93.6   88.0    93.6   88.0
 CAF                     Main
       216           5   0.0    0.0     6.4   12.0
  main                   Main
       223           0   6.4   12.0     6.4   12.0
 CAF                     GHC.Handle
       168           3   0.0    0.0     0.0    0.0

Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

Anatoly Yakovenko