
On Thu, Feb 26, 2009 at 6:23 PM, Don Stewart
But note the lazy list of Double pairs, so the inner loop still looks like this though:
$wlgo :: Int# -> [(Double, Double)] -> Int
$wlgo = \ (ww_s1pv :: Int#) (w_s1px :: [(Double, Double)]) -> case w_s1px of wild_aTl { [] -> I# ww_s1pv; : x_aTp xs_aTq -> case x_aTp of wild1_B1 { (x1_ak3, y_ak5) -> case x1_ak3 of wild2_aX8 { D# x2_aXa -> case y_ak5 of wild3_XYs { D# x3_XYx -> case <=## (sqrtDouble# (+## (*## x2_aXa x2_aXa) (*## x3_XYx x3_XYx))) 1.0 of wild4_X1D { False -> $wlgo ww_s1pv xs_aTq; True -> $wlgo (+# ww_s1pv 1) xs_aTq }
while we want to keep everything in registers with something like:
Int# -> Double# -> Double# -> Int#
So we'll be paying a penalty to force the next elem of the list (instead of just calling the Double generator). This definitely has an impact on performance.
$ ghc-core B.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields
$ time ./B 10000000 3.1407688 ./B 10000000 2.41s user 0.01s system 99% cpu 2.415 total
Now, what if we just rewrote that inner loop directly to avoid intermediate stuff? That'd give us a decent lower bound.
{-# LANGUAGE BangPatterns #-}
import System.Environment import System.Random.Mersenne
isInCircle :: Double -> Double -> Bool isInCircle x y = sqrt (x*x + y*y) <= 1.0
countHits :: Int -> IO Int countHits lim = do g <- newMTGen Nothing let go :: Int -> Int -> IO Int go !throws !hits | throws >= lim = return hits | otherwise = do x <- random g -- use mersenne-random-pure64 to stay pure! y <- random g if isInCircle x y then go (throws+1) (hits+1) else go (throws+1) hits go 0 0
monteCarloPi :: Int -> IO Double monteCarloPi n = do hits <- countHits n return $ 4.0 * fromIntegral hits / fromIntegral n
main = do [n] <- getArgs res <- monteCarloPi (read n) print res
And now the inner loop looks like:
$wa_s1yW :: Int# -> Int# -> State# RealWorld -> (# State# RealWorld, Int #)
Pretty good. Can't avoid the Int boxed return (and resulting heap check) due to use of IO monad. But at least does away with heap allocs in the inner loop!
How does it go:
$ ghc-core A.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields
$ time ./A 10000000 3.1412564 ./A 10000000 0.81s user 0.00s system 99% cpu 0.818 total
Ok. So 3 times faster. Now the goal is to recover the high level version. We have many tools to employ: switching to mersenne-random-pure64 might help here. And seeing if you can fuse filling a uvector with randoms, and folding over it... t
-- Don
Very nice! I also wrote a naive version which used uvector but it was about twice as slow as the original Haskell version. I wanted to write "lengthU . filterU isInCircle" because that clearly expresses the algorithm. Sadly I was at work and didn't have time for profile the program to see what was wrong. Still, I couldn't resist having a go at the problem :-)