
so I had a look at the code. The loops are all fine. replicateM_ isn't a problem, but getDot is decidedly non trivial. Lots of pattern matching on different vector forms, and to top it off ffi calls. With some inlining in the blas library I was able to cut a few seconds off the running time, but getDot looks to be fundamentally a bit complicated in the current implementation. I wonder if you'll get different results with hmatrix? Anyway, this is a library issue. Better take it up with Patrick. Pass on to the library author the C code, the Haskell you think should be compiled identically. -- Don aeyakovenko:
i get the same crappy performance with:
$ cat htestdot.hs {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields -fglasgow-exts -fbang-patterns -lcblas#-} module Main where
import Data.Vector.Dense.IO import Control.Monad
main = do let size = 10 let times = 10*1000*1000 v1::IOVector Int Double <- newListVector size $ replicate size 0.1 v2::IOVector Int Double <- newListVector size $ replicate size 0.1 replicateM_ times $ v1 `getDot` v2
On Fri, Jun 27, 2008 at 7:41 PM, Dan Doel
wrote: On Friday 27 June 2008, Anatoly Yakovenko wrote:
$ cat htestdot.hs {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields -fglasgow-exts -fbang-patterns -lcblas#-} module Main where
import Data.Vector.Dense.IO import Control.Monad
main = do let size = 10 let times = 10*1000*1000 v1::IOVector Int Double <- newListVector size $ replicate size 0.1 v2::IOVector Int Double <- newListVector size $ replicate size 0.1 sum <- foldM (\ ii zz -> do rv <- v1 `getDot` v2 return $ zz + rv ) 0.0 [0..times] print $ sum
Hackage is down for the time being, so I can't install blas and look at the core for your program. However, there are still some reasons why this code would be slow.
For instance, a brief experiment seems to indicate that foldM is not a good consumer in the foldr/build sense, so no deforestation occurs. Your program is iterating over a 10-million element lazy list. That's going to add overhead. I wrote a simple test program which just adds 0.1 in each iteration:
---- snip ----
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import Control.Monad
main = do let times = 10*1000*1000 sum <- foldM (\_ zz -> return $ zz + 0.1) 0.0 [0..times] -- sum <- foo 0 times 0.0 print $ sum
foo :: Int -> Int -> Double -> IO Double foo k m !zz | k <= m = foo (k+1) m (zz + 0.1) | otherwise = return zz
---- snip ----
With foldM, it takes 2.5 seconds on my machine. If you comment that line, and use foo instead, it takes around .1 seconds. So that's a factor of what, 250? That loop allows for a lot more unboxing, which allows much better code to be generated.
When Hackage comes back online, I'll take a look at your code, and see if I can make it run faster, but you might want to try it yourself in the time being. Strictifying the addition of the accumulator is probably a good idea, for instance.
Cheers, -- Dan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe