Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

28 Jun 2008

      so I had a look at the code. The loops are all fine. replicateM_ isn't 
a problem, but getDot is decidedly non trivial. Lots of pattern matching
on different vector forms, and to top it off ffi calls.

With some inlining in the blas library I was able to cut a few seconds
off the running time, but getDot looks to be fundamentally a bit
complicated in the current implementation.

I wonder if you'll get different results with hmatrix?

Anyway, this is a library issue. Better take it up with Patrick.
Pass on to the library author the C code, the Haskell you think should
be compiled identically.

-- Don

aeyakovenko:
...
i get the same crappy performance with:
$ cat htestdot.hs
{-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
-fglasgow-exts -fbang-patterns -lcblas#-}
module Main where
import Data.Vector.Dense.IO
import Control.Monad
main = do
   let size = 10
   let times = 10*1000*1000
   v1::IOVector Int Double <- newListVector size $ replicate size 0.1
   v2::IOVector Int Double <- newListVector size $ replicate size 0.1
   replicateM_ times $ v1 `getDot` v2
On Fri, Jun 27, 2008 at 7:41 PM, Dan Doel  wrote:
...
On Friday 27 June 2008, Anatoly Yakovenko wrote:
...
$ cat htestdot.hs
{-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
-fglasgow-exts -fbang-patterns -lcblas#-}
module Main where
import Data.Vector.Dense.IO
import Control.Monad
main = do
   let size = 10
   let times = 10*1000*1000
   v1::IOVector Int Double <- newListVector size $ replicate size 0.1
   v2::IOVector Int Double <- newListVector size $ replicate size 0.1
   sum <- foldM (\ ii zz -> do
      rv <- v1 `getDot` v2
      return $ zz + rv
      ) 0.0 [0..times]
   print $ sum
Hackage is down for the time being, so I can't install blas and look at the
core for your program. However, there are still some reasons why this code
would be slow.
For instance, a brief experiment seems to indicate that foldM is not a good
consumer in the foldr/build sense, so no deforestation occurs. Your program
is iterating over a 10-million element lazy list. That's going to add
overhead. I wrote a simple test program which just adds 0.1 in each
iteration:
---- snip ----
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import Control.Monad
main = do
 let times = 10*1000*1000
 sum <- foldM (\_ zz -> return $ zz + 0.1) 0.0 [0..times]
--  sum <- foo 0 times 0.0
 print $ sum
foo :: Int -> Int -> Double -> IO Double
foo k m !zz
 | k <= m     = foo (k+1) m (zz + 0.1)
 | otherwise = return zz
---- snip ----
With foldM, it takes 2.5 seconds on my machine. If you comment that line, and
use foo instead, it takes around .1 seconds. So that's a factor of what, 250?
That loop allows for a lot more unboxing, which allows much better code to be
generated.
When Hackage comes back online, I'll take a look at your code, and see if I
can make it run faster, but you might want to try it yourself in the time
being. Strictifying the addition of the accumulator is probably a good idea,
for instance.
Cheers,
-- Dan
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

Don Stewart