
here is the C:
#include

On Tue, Jun 17, 2008 at 9:00 PM, Anatoly Yakovenko
here is the C:
#include
#include int main() { int size = 1024; int ii = 0; double* v1 = malloc(sizeof(double) * (size)); double* v2 = malloc(sizeof(double) * (size)); for(ii = 0; ii < size*size; ++ii) { double _dd = cblas_ddot(0, v1, size, v2, size); } free(v1); free(v2); }
Your C compiler sees that you're not using the result of cblas_ddot, so it doesn't even bother to call it. That loop never gets run. All your program does at runtime is call malloc and free twice, which is very fast :-)

#include
#include int main() { int size = 1024; int ii = 0; double* v1 = malloc(sizeof(double) * (size)); double* v2 = malloc(sizeof(double) * (size)); for(ii = 0; ii < size*size; ++ii) { double _dd = cblas_ddot(0, v1, size, v2, size); } free(v1); free(v2); }
Your C compiler sees that you're not using the result of cblas_ddot, so it doesn't even bother to call it. That loop never gets run. All your program does at runtime is call malloc and free twice, which is very fast :-)
C doesn't work like that :). functions always get called. but i did
find a problem with my C code, i am incorrectly calling the dot
production function:
#include

On Wed, Jun 18, 2008 at 9:16 AM, Anatoly Yakovenko
C doesn't work like that :)
Yes it can. You would have to check the disassembly to be sure, but C compilers can, and do, perform dead code elimination. AGL -- Adam Langley agl@imperialviolet.org http://www.imperialviolet.org

Anatoly Yakovenko wrote:
#include
#include int main() { int size = 1024; int ii = 0; double* v1 = malloc(sizeof(double) * (size)); double* v2 = malloc(sizeof(double) * (size)); for(ii = 0; ii < size*size; ++ii) { double _dd = cblas_ddot(0, v1, size, v2, size); } free(v1); free(v2); } Your C compiler sees that you're not using the result of cblas_ddot, so it doesn't even bother to call it. That loop never gets run. All your program does at runtime is call malloc and free twice, which is very fast :-)
C doesn't work like that :).
C compilers can do what they like ;) GCC in particular is pretty good at removing dead code, including entire loops. However it shouldn't eliminate the call to cblas_ddot unless it thinks cblas_ddot has no side effects at all, which would be surprising unless it's inlined somehow. Jules

On Wed, Jun 18, 2008 at 06:03:42PM +0100, Jules Bean wrote:
Anatoly Yakovenko wrote:
#include
#include int main() { int size = 1024; int ii = 0; double* v1 = malloc(sizeof(double) * (size)); double* v2 = malloc(sizeof(double) * (size)); for(ii = 0; ii < size*size; ++ii) { double _dd = cblas_ddot(0, v1, size, v2, size); } free(v1); free(v2); } Your C compiler sees that you're not using the result of cblas_ddot, so it doesn't even bother to call it. That loop never gets run. All your program does at runtime is call malloc and free twice, which is very fast :-)
C doesn't work like that :).
C compilers can do what they like ;)
GCC in particular is pretty good at removing dead code, including entire loops. However it shouldn't eliminate the call to cblas_ddot unless it thinks cblas_ddot has no side effects at all, which would be surprising unless it's inlined somehow.
Or unless it's been annotated as pure, which it should be. David

On Wed, Jun 18, 2008 at 09:16:24AM -0700, Anatoly Yakovenko wrote:
#include
#include int main() { int size = 1024; int ii = 0; double* v1 = malloc(sizeof(double) * (size)); double* v2 = malloc(sizeof(double) * (size)); for(ii = 0; ii < size*size; ++ii) { double _dd = cblas_ddot(0, v1, size, v2, size); } free(v1); free(v2); }
Your C compiler sees that you're not using the result of cblas_ddot, so it doesn't even bother to call it. That loop never gets run. All your program does at runtime is call malloc and free twice, which is very fast :-)
C doesn't work like that :). functions always get called. but i did find a problem with my C code, i am incorrectly calling the dot production function:
See a recent article in lwn on pure and const functions to see how gcc is able to perform dead code elimination and CSE, provided its given annotations on the relevant functions. I'd certainly hope that your blas library is properly annotated!
#include
#include #include #include int main() { int size = 1024; int ii = 0; double dd = 0.0; double* v1 = malloc(sizeof(double) * (size)); double* v2 = malloc(sizeof(double) * (size)); for(ii = 0; ii < size; ++ii) { v1[ii] = 0.1; v2[ii] = 0.1; } for(ii = 0; ii < size*size; ++ii) { dd += cblas_ddot(size, v1, 0, v2, 0); } free(v1); free(v2); printf("%f\n", dd); return 0; }
time ./testdot 10737418.240187
real 0m2.200s user 0m2.190s sys 0m0.010s
So C is about twice as fast. I can live with that.
I suspect that it is your initialization that is the difference. For one thing, you've initialized the arrays to different values, and in your C code you've fused what are two separate loops in your Haskell code. So you've not only given the C compiler an easier loop to run (since you're initializing the array to a constant rather than to a sequence of numbers), but you've also manually optimized that initialization. In fact, this fusion could be precisely the factor of two. Why not see what happens in Haskell if you create just one vector and dot it with itself? (of course, that'll also make the blas call faster, so you'll need to be careful in your interpretation of your results.) David

I suspect that it is your initialization that is the difference. For one thing, you've initialized the arrays to different values, and in your C code you've fused what are two separate loops in your Haskell code. So you've not only given the C compiler an easier loop to run (since you're initializing the array to a constant rather than to a sequence of numbers), but you've also manually optimized that initialization. In fact, this fusion could be precisely the factor of two. Why not see what happens in Haskell if you create just one vector and dot it with itself? (of course, that'll also make the blas call faster, so you'll need to be careful in your interpretation of your results.)
The difference cant be in the initialization. I am calling the dot
product a million times, the malloc and init in both cases are
insignificant. Also, "fusing" the two loops in C probably wont help,
if anything having each loop run separate is likely to be faster and
result in less cache misses.
In this case, i am using vectors of size 10 only, and calling the loop
10 million times, haskell is far far slower, or 35 times. That's
pretty crappy.
$ cat htestdot.hs
{-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
-fglasgow-exts -fbang-patterns -lcblas#-}
module Main where
import Data.Vector.Dense.IO
import Control.Monad
main = do
let size = 10
let times = 10*1000*1000
v1::IOVector Int Double <- newListVector size $ replicate size 0.1
v2::IOVector Int Double <- newListVector size $ replicate size 0.1
sum <- foldM (\ ii zz -> do
rv <- v1 `getDot` v2
return $ zz + rv
) 0.0 [0..times]
print $ sum
$ ghc --make htestdot.hs
$ time ./htestdot
1.00000001e7
real 0m17.328s
user 0m17.320s
sys 0m0.010
$ cat testdot.c
#include

On Friday 27 June 2008, Anatoly Yakovenko wrote:
$ cat htestdot.hs {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields -fglasgow-exts -fbang-patterns -lcblas#-} module Main where
import Data.Vector.Dense.IO import Control.Monad
main = do let size = 10 let times = 10*1000*1000 v1::IOVector Int Double <- newListVector size $ replicate size 0.1 v2::IOVector Int Double <- newListVector size $ replicate size 0.1 sum <- foldM (\ ii zz -> do rv <- v1 `getDot` v2 return $ zz + rv ) 0.0 [0..times] print $ sum
Hackage is down for the time being, so I can't install blas and look at the core for your program. However, there are still some reasons why this code would be slow. For instance, a brief experiment seems to indicate that foldM is not a good consumer in the foldr/build sense, so no deforestation occurs. Your program is iterating over a 10-million element lazy list. That's going to add overhead. I wrote a simple test program which just adds 0.1 in each iteration: ---- snip ---- {-# LANGUAGE BangPatterns #-} module Main (main) where import Control.Monad main = do let times = 10*1000*1000 sum <- foldM (\_ zz -> return $ zz + 0.1) 0.0 [0..times] -- sum <- foo 0 times 0.0 print $ sum foo :: Int -> Int -> Double -> IO Double foo k m !zz | k <= m = foo (k+1) m (zz + 0.1) | otherwise = return zz ---- snip ---- With foldM, it takes 2.5 seconds on my machine. If you comment that line, and use foo instead, it takes around .1 seconds. So that's a factor of what, 250? That loop allows for a lot more unboxing, which allows much better code to be generated. When Hackage comes back online, I'll take a look at your code, and see if I can make it run faster, but you might want to try it yourself in the time being. Strictifying the addition of the accumulator is probably a good idea, for instance. Cheers, -- Dan

i get the same crappy performance with:
$ cat htestdot.hs
{-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
-fglasgow-exts -fbang-patterns -lcblas#-}
module Main where
import Data.Vector.Dense.IO
import Control.Monad
main = do
let size = 10
let times = 10*1000*1000
v1::IOVector Int Double <- newListVector size $ replicate size 0.1
v2::IOVector Int Double <- newListVector size $ replicate size 0.1
replicateM_ times $ v1 `getDot` v2
On Fri, Jun 27, 2008 at 7:41 PM, Dan Doel
On Friday 27 June 2008, Anatoly Yakovenko wrote:
$ cat htestdot.hs {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields -fglasgow-exts -fbang-patterns -lcblas#-} module Main where
import Data.Vector.Dense.IO import Control.Monad
main = do let size = 10 let times = 10*1000*1000 v1::IOVector Int Double <- newListVector size $ replicate size 0.1 v2::IOVector Int Double <- newListVector size $ replicate size 0.1 sum <- foldM (\ ii zz -> do rv <- v1 `getDot` v2 return $ zz + rv ) 0.0 [0..times] print $ sum
Hackage is down for the time being, so I can't install blas and look at the core for your program. However, there are still some reasons why this code would be slow.
For instance, a brief experiment seems to indicate that foldM is not a good consumer in the foldr/build sense, so no deforestation occurs. Your program is iterating over a 10-million element lazy list. That's going to add overhead. I wrote a simple test program which just adds 0.1 in each iteration:
---- snip ----
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import Control.Monad
main = do let times = 10*1000*1000 sum <- foldM (\_ zz -> return $ zz + 0.1) 0.0 [0..times] -- sum <- foo 0 times 0.0 print $ sum
foo :: Int -> Int -> Double -> IO Double foo k m !zz | k <= m = foo (k+1) m (zz + 0.1) | otherwise = return zz
---- snip ----
With foldM, it takes 2.5 seconds on my machine. If you comment that line, and use foo instead, it takes around .1 seconds. So that's a factor of what, 250? That loop allows for a lot more unboxing, which allows much better code to be generated.
When Hackage comes back online, I'll take a look at your code, and see if I can make it run faster, but you might want to try it yourself in the time being. Strictifying the addition of the accumulator is probably a good idea, for instance.
Cheers, -- Dan

aeyakovenko:
i get the same crappy performance with:
$ cat htestdot.hs {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields -fglasgow-exts -fbang-patterns -lcblas#-} module Main where
import Data.Vector.Dense.IO import Control.Monad
main = do let size = 10 let times = 10*1000*1000 v1::IOVector Int Double <- newListVector size $ replicate size 0.1 v2::IOVector Int Double <- newListVector size $ replicate size 0.1 replicateM_ times $ v1 `getDot` v2
replicateM_ is using a list underneath for control as well, replicateM n x = sequence (replicate n x) Try writing a simple recursive loop, as Dan suggested. No list node forcing overhead, so in a very tight loop you'll just want the index in a register. See here for more examples of tight register loops, http://cgi.cse.unsw.edu.au/~dons/blog/2008/05/16#fast In general, if you're chasing C performance for a loop, your best bet is to write a loop first. Then later see if you can get the same kind of code from higher order, lazy, monadic functions. -- Don

so I had a look at the code. The loops are all fine. replicateM_ isn't a problem, but getDot is decidedly non trivial. Lots of pattern matching on different vector forms, and to top it off ffi calls. With some inlining in the blas library I was able to cut a few seconds off the running time, but getDot looks to be fundamentally a bit complicated in the current implementation. I wonder if you'll get different results with hmatrix? Anyway, this is a library issue. Better take it up with Patrick. Pass on to the library author the C code, the Haskell you think should be compiled identically. -- Don aeyakovenko:
i get the same crappy performance with:
$ cat htestdot.hs {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields -fglasgow-exts -fbang-patterns -lcblas#-} module Main where
import Data.Vector.Dense.IO import Control.Monad
main = do let size = 10 let times = 10*1000*1000 v1::IOVector Int Double <- newListVector size $ replicate size 0.1 v2::IOVector Int Double <- newListVector size $ replicate size 0.1 replicateM_ times $ v1 `getDot` v2
On Fri, Jun 27, 2008 at 7:41 PM, Dan Doel
wrote: On Friday 27 June 2008, Anatoly Yakovenko wrote:
$ cat htestdot.hs {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields -fglasgow-exts -fbang-patterns -lcblas#-} module Main where
import Data.Vector.Dense.IO import Control.Monad
main = do let size = 10 let times = 10*1000*1000 v1::IOVector Int Double <- newListVector size $ replicate size 0.1 v2::IOVector Int Double <- newListVector size $ replicate size 0.1 sum <- foldM (\ ii zz -> do rv <- v1 `getDot` v2 return $ zz + rv ) 0.0 [0..times] print $ sum
Hackage is down for the time being, so I can't install blas and look at the core for your program. However, there are still some reasons why this code would be slow.
For instance, a brief experiment seems to indicate that foldM is not a good consumer in the foldr/build sense, so no deforestation occurs. Your program is iterating over a 10-million element lazy list. That's going to add overhead. I wrote a simple test program which just adds 0.1 in each iteration:
---- snip ----
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import Control.Monad
main = do let times = 10*1000*1000 sum <- foldM (\_ zz -> return $ zz + 0.1) 0.0 [0..times] -- sum <- foo 0 times 0.0 print $ sum
foo :: Int -> Int -> Double -> IO Double foo k m !zz | k <= m = foo (k+1) m (zz + 0.1) | otherwise = return zz
---- snip ----
With foldM, it takes 2.5 seconds on my machine. If you comment that line, and use foo instead, it takes around .1 seconds. So that's a factor of what, 250? That loop allows for a lot more unboxing, which allows much better code to be generated.
When Hackage comes back online, I'll take a look at your code, and see if I can make it run faster, but you might want to try it yourself in the time being. Strictifying the addition of the accumulator is probably a good idea, for instance.
Cheers, -- Dan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 19 Jun 2008, at 4:16 am, Anatoly Yakovenko wrote:
C doesn't work like that :). functions always get called.
Not true. A C compiler must produce the same *effect* as if the function had been called, but if by some means the compiler knows that the function has no effect, it is entitled to skip the call. In particular, the C compiler I normally use offers these pragmas, amongst others: #pragma does_not_write_global_data (funcname [, funcname]) #pragma no_side_effect(funcname[, funcname]) So with a declaration like extern double cblas_ddot( int, double const *, int, double const *, int); #pragma no_side_effect (cblas_ddot)
the compiler would be completely within its rights to discard any call to cblas_ddot() whose result was not used. (As it happens, it didn't, but it would have been allowed to.) If using gcc, extern double cblas_ddot( ... as before ...) __attribute__ ((const)); seems to have the same effect, certainly the test case I tried did in fact completely eliminate a call to cblas_ddot() when so declared. Since the malloc() results pointed to uninitialised memory, the C compiler was entitled to do anything it pleased anyway.
participants (8)
-
Adam Langley
-
Anatoly Yakovenko
-
Bryan O'Sullivan
-
Dan Doel
-
David Roundy
-
Don Stewart
-
Jules Bean
-
Richard A. O'Keefe