Performance question

haskell＠kudling.de

26 Feb 2009 26 Feb '09

9:53 a.m.

Hi, i have compared a C++ implementation with a Haskell implementation of the Monte Carlo Pi approximation: http://lennart.kudling.de/haskellPi/ The Haskell version is 100 times slower and i wonder whether i do something obvious wrong. Profiling says that the majority of the time is spend in "main". But i have no idea where. Can someone give me a hint? Thanks, Lenny individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 1 0 0.0 0.0 100.0 100.0 main Main 254 1 88.1 90.8 100.0 100.0 monteCarloPi Main 255 1 0.6 1.1 11.9 9.2 pairs Main 257 10000000 0.7 1.4 0.7 1.4 countHits Main 256 10000001 4.2 2.9 10.6 6.7 accumulateHit Main 258 27852236 3.0 2.3 6.4 3.8 isInCircle Main 259 30000000 3.3 1.5 3.3 1.5 CAF:lit_r1A7 Main 248 1 0.0 0.0 0.0 0.0 isInCircle Main 260 0 0.0 0.0 0.0 0.0

Show replies by date

Thomas van Noort

26 Feb 26 Feb

10 a.m.

First thing I noticed, how about removing the sqrt in isInCircle: isInCircle :: (Floating a, Ord a) => (a,a) -> Bool isInCircle (x,y) = x*x + y*y <= 1.0 But you can remove sqrt from the C++ implementation as well, so it only improves the relative performance if the C++ implementation of sqrt is worse than its Haskell counterpart. Regards, Thomas haskell@kudling.de wrote:

...

Hi,

i have compared a C++ implementation with a Haskell implementation of the Monte Carlo Pi approximation:

http://lennart.kudling.de/haskellPi/

The Haskell version is 100 times slower and i wonder whether i do something obvious wrong.

Profiling says that the majority of the time is spend in "main". But i have no idea where.

Can someone give me a hint?

Thanks, Lenny

individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc

MAIN MAIN 1 0 0.0 0.0 100.0 100.0 main Main 254 1 88.1 90.8 100.0 100.0 monteCarloPi Main 255 1 0.6 1.1 11.9 9.2 pairs Main 257 10000000 0.7 1.4 0.7 1.4 countHits Main 256 10000001 4.2 2.9 10.6 6.7 accumulateHit Main 258 27852236 3.0 2.3 6.4 3.8 isInCircle Main 259 30000000 3.3 1.5 3.3 1.5 CAF:lit_r1A7 Main 248 1 0.0 0.0 0.0 0.0 isInCircle Main 260 0 0.0 0.0 0.0 0.0 _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Thomas van Noort

10:02 a.m.

...

But you can remove sqrt from the C++ implementation as well, so it only improves the relative performance if the C++ implementation of sqrt is worse than its Haskell counterpart.

Oops, of course I mean, you only improve if Haskell's implementation is worse than C++'s implementation :)

Ben Lippmeier

10:09 a.m.

Yep, this program will crawl. You can get reasonable numeric performance out of GHC, but you need to work at it. There is some advice in the GHC manual at http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html . The first thing I would do is replace your isInCircle :: (Floating a, Ord a) => (a,a) -> Bool with isInCircle :: (Double, Double) -> Bool Ben. On 26/02/2009, at 8:53 PM, haskell@kudling.de wrote:

...

Hi,

i have compared a C++ implementation with a Haskell implementation of the Monte Carlo Pi approximation:

http://lennart.kudling.de/haskellPi/

The Haskell version is 100 times slower and i wonder whether i do something obvious wrong.

Profiling says that the majority of the time is spend in "main". But i have no idea where.

Can someone give me a hint?

Thanks, Lenny

Ketil Malde

10:25 a.m.

haskell@kudling.de writes:

...

Profiling says that the majority of the time is spend in "main". But i have no idea where. Can someone give me a hint?

Yes. Lots of them, but somehow, I suspect nobody tried your code.

...

COST CENTRE MODULE no. entries %time %alloc %time %alloc

MAIN MAIN 1 0 0.0 0.0 100.0 100.0 main Main 254 1 88.1 90.8 100.0 100.0 monteCarloPi Main 255 1 0.6 1.1 11.9 9.2 pairs Main 257 10000000 0.7 1.4 0.7 1.4 countHits Main 256 10000001 4.2 2.9 10.6 6.7 accumulateHit Main 258 27852236 3.0 2.3 6.4 3.8 isInCircle Main 259 30000000 3.3 1.5 3.3 1.5 CAF:lit_r1A7 Main 248 1 0.0 0.0 0.0 0.0 isInCircle Main 260 0 0.0 0.0 0.0 0.0

Thomas van Noort:

...

First thing I noticed, how about removing the sqrt in isInCircle:

I did this. The result was the same - the sqrt doesn't matter, and perhaps it even gets optimized away?

...

The first thing I would do is replace your isInCircle :: (Floating a, Ord a) => (a,a) -> Bool with isInCircle :: (Double, Double) -> Bool

Then I did that. The result was still the same - I bet GHC is smart enought to specialize this on its own. The intresting thing about your profile is that all the time is spent in 'main'. Now, why would that be? I refactored a bit, and in partiuclar wrote this function: mkRandoms :: IO [Double] mkRandoms = do randomNumberGenerator <- getStdGen return $ randoms randomNumberGenerator Here's the new profile (10^6 iterations): COST CENTRE MODULE %time %alloc mkRandoms Main 96.1 96.0 accumulateHit Main 1.6 1.7 pairs Main 1.2 1.3 monteCarloPi Main 0.2 1.0 So it seems we're just tremendously lousy at generating random Doubles. Eliminating this bottleneck, the remaining 3.9% would put it at about 2x the C++ performance. -k -- If I haven't seen further, it is by standing in the footprints of giants

Achim Schneider

6:37 p.m.

Ketil Malde wrote:

...

So it seems we're just tremendously lousy at generating random Doubles.

We had this a while ago, and Don was kind enough to post some bindings to dSFMT: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/mersenne-random-p... which could have a purer interface, but then the upstream code could be less blatantly impure. Anyway, that's about as fast as you can get. -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

Benedikt Huber

10:42 a.m.

haskell@kudling.de schrieb:

...

Hi,

i have compared a C++ implementation with a Haskell implementation of the Monte Carlo Pi approximation:

http://lennart.kudling.de/haskellPi/

The Haskell version is 100 times slower and i wonder whether i do something obvious wrong.

Hi, Nice benchmark, but I think the culprit is the random number generator. For simplicity, here is a 1-1 translation of the C++ code:

...

main = do args <- getArgs let samples = read $ head args randomNumbers <- genRandoms print (monteCarlo samples randomNumbers) genRandoms :: IO [Double] genRandoms = liftM randoms getStdGen monteCarlo :: Int -> [Double] -> Double monteCarlo samples = go 0 samples where go cnt 0 _ = (4.0 :: Double) * (fromIntegral cnt) / (fromIntegral > samples) go cnt k (x:y:rs) | sqrt (x*x + y*y) <= 1.0 = go (cnt+1) (k-1) rs | otherwise = go cnt (k-1) rs

The profiling shows that most time is spend in genRandoms: genRandoms Main 199 1 96.5 98.3 96.5 98.3 When replacing genRandoms by (repeat 0.5), the haskell code is only a little bit slower than the C++ code *with* random number generation. Of course, the random number generation in C++ takes time too:

...

montecarlo(0) $ pprof --text ./monte_carlo /tmp/profile Total: 6691 samples 3928 58.7% 58.7% 5965 89.1% _main 2134 31.9% 90.6% 2134 31.9% _do_rand 435 6.5% 97.1% 435 6.5% _rand 194 2.9% 100.0% 194 2.9% 0x00003072 0 0.0% 100.0% 6691 100.0% __mh_execute_header 0 0.0% 100.0% 6691 100.0% start

But the factor 100 is clearly due to genRandoms. cheers, benedikt

...

Profiling says that the majority of the time is spend in "main". But i have no idea where.

Can someone give me a hint?

Thanks, Lenny

individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc

MAIN MAIN 1 0 0.0 0.0 100.0 100.0 main Main 254 1 88.1 90.8 100.0 100.0 monteCarloPi Main 255 1 0.6 1.1 11.9 9.2 pairs Main 257 10000000 0.7 1.4 0.7 1.4 countHits Main 256 10000001 4.2 2.9 10.6 6.7 accumulateHit Main 258 27852236 3.0 2.3 6.4 3.8 isInCircle Main 259 30000000 3.3 1.5 3.3 1.5 CAF:lit_r1A7 Main 248 1 0.0 0.0 0.0 0.0 isInCircle Main 260 0 0.0 0.0 0.0 0.0

Roel van Dijk

11 a.m.

I replaced the standard random number generated with the one from mersenne-random. On my system this makes the resulting program about 14 times faster than the original. I also made a change to accumulateHit because it doesn't need to count to total. That is already known. {-# LANGUAGE BangPatterns #-} import System( getArgs ) import Data.List( foldl' ) import System.Random.Mersenne pairs :: [a] -> [(a,a)] pairs [] = [] pairs (x:[]) = [] pairs (x:y:rest) = (x, y) : pairs rest isInCircle :: (Double, Double) -> Bool isInCircle (x,y) = sqrt (x*x + y*y) <= 1.0 accumulateHit :: Int -> (Double, Double) -> Int accumulateHit (!hits) pair | isInCircle pair = hits + 1 | otherwise = hits countHits :: [(Double, Double)] -> Int countHits ps = foldl' accumulateHit 0 ps monteCarloPi :: Int -> [(Double, Double)] -> Double monteCarloPi n xs = 4.0 * fromIntegral hits / fromIntegral n where hits = countHits $ take n xs main = do args <- getArgs let samples = read $ head args randomNumberGenerator <- getStdGen randomNumbers <- randoms randomNumberGenerator let res = monteCarloPi samples $ pairs randomNumbers putStrLn $ show $ res

Bayley, Alistair

11:35 a.m.

New subject: seg-fault in mersenne-random with SSE2 (was Performance question)

...

From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Roel van Dijk

I replaced the standard random number generated with the one from mersenne-random. On my system this makes the resulting program about 14 times faster than the original. I also made a change to accumulateHit because it doesn't need to count to total. That is already known.

I tried this too, but got a seg fault (!), so I stripped it back to a small test program. This is with mersenne-random, setup configured with -fuse_sse2: module Main (main) where import System.Random.Mersenne import System (getArgs) rands :: Int -> IO [Double] rands samples = do rng <- getStdGen rns <- randoms rng return (take (2*samples) rns) main = do args <- getArgs let samples = read (head args) randomNumbers <- rands samples print randomNumbers On WinXp with ghc-6.10.1 it always seg faults. If I rebuild without -fuse-sse2 then it's happy. I'm assuming that my machine has SSE2; it's a little old, but it's a P4 (3GHz), so I think it should. Is there an easy way to tell? Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

Don Stewart

4:24 p.m.

New subject: seg-fault in mersenne-random with SSE2 (was Performance question)

Alistair.Bayley:

...

...
From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Roel van Dijk

I replaced the standard random number generated with the one from mersenne-random. On my system this makes the resulting program about 14 times faster than the original. I also made a change to accumulateHit because it doesn't need to count to total. That is already known.

I tried this too, but got a seg fault (!), so I stripped it back to a small test program. This is with mersenne-random, setup configured with -fuse_sse2:

This in the past has always meant: wrong architecture (or GCC can't handle sse2 on your system) -- Don

Peter Verswyvelen

4:48 p.m.

New subject: seg-fault in mersenne-random with SSE2 (was Performance question)

Might also mean bad alignment of data structures, maybe when marshalling data between the GHC runtime and the C implementation? If I recall correctly SSE2 requires 16-byte alignment. On Thu, Feb 26, 2009 at 5:24 PM, Don Stewart wrote:

...

Alistair.Bayley:

...
...
From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Roel van Dijk

I replaced the standard random number generated with the one from mersenne-random. On my system this makes the resulting program about 14 times faster than the original. I also made a change to accumulateHit because it doesn't need to count to total. That is already known.

I tried this too, but got a seg fault (!), so I stripped it back to a small test program. This is with mersenne-random, setup configured with -fuse_sse2:

This in the past has always meant: wrong architecture (or GCC can't handle sse2 on your system)

-- Don _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Don Stewart

5:23 p.m.

vandijk.roel:

...

I replaced the standard random number generated with the one from mersenne-random. On my system this makes the resulting program about 14 times faster than the original. I also made a change to accumulateHit because it doesn't need to count to total. That is already known.

{-# LANGUAGE BangPatterns #-}

import System( getArgs ) import Data.List( foldl' )

import System.Random.Mersenne

pairs :: [a] -> [(a,a)] pairs [] = [] pairs (x:[]) = [] pairs (x:y:rest) = (x, y) : pairs rest

isInCircle :: (Double, Double) -> Bool isInCircle (x,y) = sqrt (x*x + y*y) <= 1.0

accumulateHit :: Int -> (Double, Double) -> Int accumulateHit (!hits) pair | isInCircle pair = hits + 1 | otherwise = hits

countHits :: [(Double, Double)] -> Int countHits ps = foldl' accumulateHit 0 ps

monteCarloPi :: Int -> [(Double, Double)] -> Double monteCarloPi n xs = 4.0 * fromIntegral hits / fromIntegral n where hits = countHits $ take n xs

main = do args <- getArgs let samples = read $ head args

randomNumberGenerator <- getStdGen randomNumbers <- randoms randomNumberGenerator

let res = monteCarloPi samples $ pairs randomNumbers putStrLn $ show $ res

But note the lazy list of Double pairs, so the inner loop still looks like this though: $wlgo :: Int# -> [(Double, Double)] -> Int $wlgo = \ (ww_s1pv :: Int#) (w_s1px :: [(Double, Double)]) -> case w_s1px of wild_aTl { [] -> I# ww_s1pv; : x_aTp xs_aTq -> case x_aTp of wild1_B1 { (x1_ak3, y_ak5) -> case x1_ak3 of wild2_aX8 { D# x2_aXa -> case y_ak5 of wild3_XYs { D# x3_XYx -> case <=## (sqrtDouble# (+## (*## x2_aXa x2_aXa) (*## x3_XYx x3_XYx))) 1.0 of wild4_X1D { False -> $wlgo ww_s1pv xs_aTq; True -> $wlgo (+# ww_s1pv 1) xs_aTq } while we want to keep everything in registers with something like: Int# -> Double# -> Double# -> Int# So we'll be paying a penalty to force the next elem of the list (instead of just calling the Double generator). This definitely has an impact on performance. $ ghc-core B.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields $ time ./B 10000000 3.1407688 ./B 10000000 2.41s user 0.01s system 99% cpu 2.415 total Now, what if we just rewrote that inner loop directly to avoid intermediate stuff? That'd give us a decent lower bound. {-# LANGUAGE BangPatterns #-} import System.Environment import System.Random.Mersenne isInCircle :: Double -> Double -> Bool isInCircle x y = sqrt (x*x + y*y) <= 1.0 countHits :: Int -> IO Int countHits lim = do g <- newMTGen Nothing let go :: Int -> Int -> IO Int go !throws !hits | throws >= lim = return hits | otherwise = do x <- random g -- use mersenne-random-pure64 to stay pure! y <- random g if isInCircle x y then go (throws+1) (hits+1) else go (throws+1) hits go 0 0 monteCarloPi :: Int -> IO Double monteCarloPi n = do hits <- countHits n return $ 4.0 * fromIntegral hits / fromIntegral n main = do [n] <- getArgs res <- monteCarloPi (read n) print res And now the inner loop looks like: $wa_s1yW :: Int# -> Int# -> State# RealWorld -> (# State# RealWorld, Int #) Pretty good. Can't avoid the Int boxed return (and resulting heap check) due to use of IO monad. But at least does away with heap allocs in the inner loop! How does it go: $ ghc-core A.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields $ time ./A 10000000 3.1412564 ./A 10000000 0.81s user 0.00s system 99% cpu 0.818 total Ok. So 3 times faster. Now the goal is to recover the high level version. We have many tools to employ: switching to mersenne-random-pure64 might help here. And seeing if you can fuse filling a uvector with randoms, and folding over it... t -- Don

Roel van Dijk

7:44 p.m.

On Thu, Feb 26, 2009 at 6:23 PM, Don Stewart wrote:

...

But note the lazy list of Double pairs, so the inner loop still looks like this though:

$wlgo :: Int# -> [(Double, Double)] -> Int

$wlgo = \ (ww_s1pv :: Int#) (w_s1px :: [(Double, Double)]) -> case w_s1px of wild_aTl { [] -> I# ww_s1pv; : x_aTp xs_aTq -> case x_aTp of wild1_B1 { (x1_ak3, y_ak5) -> case x1_ak3 of wild2_aX8 { D# x2_aXa -> case y_ak5 of wild3_XYs { D# x3_XYx -> case <=## (sqrtDouble# (+## (*## x2_aXa x2_aXa) (*## x3_XYx x3_XYx))) 1.0 of wild4_X1D { False -> $wlgo ww_s1pv xs_aTq; True -> $wlgo (+# ww_s1pv 1) xs_aTq }

while we want to keep everything in registers with something like:

Int# -> Double# -> Double# -> Int#

So we'll be paying a penalty to force the next elem of the list (instead of just calling the Double generator). This definitely has an impact on performance.

$ ghc-core B.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields

$ time ./B 10000000 3.1407688 ./B 10000000 2.41s user 0.01s system 99% cpu 2.415 total

Now, what if we just rewrote that inner loop directly to avoid intermediate stuff? That'd give us a decent lower bound.

{-# LANGUAGE BangPatterns #-}

import System.Environment import System.Random.Mersenne

isInCircle :: Double -> Double -> Bool isInCircle x y = sqrt (x*x + y*y) <= 1.0

countHits :: Int -> IO Int countHits lim = do g <- newMTGen Nothing let go :: Int -> Int -> IO Int go !throws !hits | throws >= lim = return hits | otherwise = do x <- random g -- use mersenne-random-pure64 to stay pure! y <- random g if isInCircle x y then go (throws+1) (hits+1) else go (throws+1) hits go 0 0

monteCarloPi :: Int -> IO Double monteCarloPi n = do hits <- countHits n return $ 4.0 * fromIntegral hits / fromIntegral n

main = do [n] <- getArgs res <- monteCarloPi (read n) print res

And now the inner loop looks like:

$wa_s1yW :: Int# -> Int# -> State# RealWorld -> (# State# RealWorld, Int #)

Pretty good. Can't avoid the Int boxed return (and resulting heap check) due to use of IO monad. But at least does away with heap allocs in the inner loop!

How does it go:

$ ghc-core A.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields

$ time ./A 10000000 3.1412564 ./A 10000000 0.81s user 0.00s system 99% cpu 0.818 total

Ok. So 3 times faster. Now the goal is to recover the high level version. We have many tools to employ: switching to mersenne-random-pure64 might help here. And seeing if you can fuse filling a uvector with randoms, and folding over it... t

-- Don

Very nice! I also wrote a naive version which used uvector but it was about twice as slow as the original Haskell version. I wanted to write "lengthU . filterU isInCircle" because that clearly expresses the algorithm. Sadly I was at work and didn't have time for profile the program to see what was wrong. Still, I couldn't resist having a go at the problem :-)

5980

Age (days ago)

5980

Last active (days ago)

List overview

Download

12 comments

10 participants

participants (10)

Achim Schneider
Bayley, Alistair
Ben Lippmeier
Benedikt Huber
Don Stewart
haskell＠kudling.de
Ketil Malde
Peter Verswyvelen
Roel van Dijk
Thomas van Noort