Re: [Haskell-cafe] Performance question

26 Feb 2009


      On Thu, Feb 26, 2009 at 6:23 PM, Don Stewart  wrote:
...
But note the lazy list of Double pairs, so the inner loop still looks like this though:
   $wlgo :: Int# -> [(Double, Double)] -> Int
   $wlgo =
     \ (ww_s1pv :: Int#)
       (w_s1px :: [(Double, Double)]) ->
       case w_s1px of wild_aTl {
         [] -> I# ww_s1pv;
         : x_aTp xs_aTq ->
           case x_aTp of wild1_B1 { (x1_ak3, y_ak5) ->
           case x1_ak3 of wild2_aX8 { D# x2_aXa ->
           case y_ak5 of wild3_XYs { D# x3_XYx ->
           case <=##
                  (sqrtDouble#
                     (+##
                        (*## x2_aXa x2_aXa) (*## x3_XYx x3_XYx)))
                  1.0
           of wild4_X1D {
             False -> $wlgo ww_s1pv xs_aTq;
             True -> $wlgo (+# ww_s1pv 1) xs_aTq
           }
while we want to keep everything in registers with something like:
   Int# -> Double# -> Double# -> Int#
So we'll be paying a penalty to force the next elem of the list (instead of
just calling the Double generator).  This definitely has an impact on performance.
   $ ghc-core B.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields
   $ time ./B 10000000
   3.1407688
   ./B 10000000  2.41s user 0.01s system 99% cpu 2.415 total
Now, what if we just rewrote that inner loop directly to avoid intermediate stuff? That'd give
us a decent lower bound.
   {-# LANGUAGE BangPatterns #-}
   import System.Environment
   import System.Random.Mersenne
   isInCircle :: Double -> Double -> Bool
   isInCircle x y = sqrt (x*x + y*y) <= 1.0
   countHits :: Int -> IO Int
   countHits lim = do
       g <- newMTGen Nothing
       let go :: Int -> Int -> IO Int
           go !throws !hits
               | throws >= lim  = return hits
               | otherwise = do
                   x <- random g   -- use mersenne-random-pure64 to stay pure!
                   y <- random g
                   if isInCircle x y
                       then go (throws+1) (hits+1)
                       else go (throws+1) hits
       go 0 0
   monteCarloPi :: Int -> IO Double
   monteCarloPi n = do
       hits <- countHits n
       return $ 4.0 * fromIntegral hits / fromIntegral n
   main = do
       [n] <- getArgs
       res <- monteCarloPi (read n)
       print res
And now the inner loop looks like:
     $wa_s1yW :: Int#
                 -> Int#
                 -> State# RealWorld
                 -> (# State# RealWorld, Int #)
Pretty good. Can't avoid the Int boxed return (and resulting heap check) due to use of IO monad.
But at least does away with heap allocs in the inner loop!
How does it go:
   $ ghc-core A.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields
   $ time ./A 10000000
   3.1412564
   ./A 10000000  0.81s user 0.00s system 99% cpu 0.818 total
Ok. So 3 times faster. Now the goal is to recover the high level version.
We have many tools to employ: switching to mersenne-random-pure64 might help
here. And seeing if you can fuse filling a uvector with randoms, and folding
over it... t
-- Don
Very nice! I also wrote a naive version which used uvector but it was
about twice as slow as the original Haskell version. I wanted to write
"lengthU . filterU isInCircle" because that clearly expresses the
algorithm. Sadly I was at work and didn't have time for profile the
program to see what was wrong. Still, I couldn't resist having a go at
the problem :-)

Re: [Haskell-cafe] Performance question

Roel van Dijk