How to improve speed? (MersenneTwister is several times slower than C version)

Hi all, On HaWiki was an announcement of MersenneTwister made by Lennart Augustsson. On a typical run to find out 10000000th rnd num the output is (code shown below): $ time ./testMTla Testing Mersenne Twister. Result is [3063349438] real 0m4.925s user 0m4.856s I was exercising with the very same algorithm and tried to make it efficient (by using IOUArray): now a typical run looks like (code shown below): $ time ./testMT Testing Mersenne Twister. 3063349438 real 0m3.032s user 0m3.004s The original C-version (modified so that only the last number is shown) gives typically $ time ./mt19937ar outputs of genrand_int32() 3063349438 real 0m0.624s user 0m0.616s Results are similar with 64 bit IOUArray against 64 bit C variant. C seems to work about 5 to 10 times faster in this case. I have tried to do different things but now I'm stuck. unsafeRead and unsafeWrite improved a bit the lazy (STUArray-version) and IOUArray-versions but not very much. I took a look of Core file but then, I'm not sure where the boxed values are ok. E.g. should IOUArray Int Word64 be replaced with something else? Any hints and comments on how to improve the efficiency and make everything better will be appreciated a lot! br, Isto ----------------------------- testMTla.hs (MersenneTwister, see HaWiki) module Main where -- ghc -O3 -optc-O3 -optc-ffast-math -fexcess-precision --make testMTla import MersenneTwister main = do putStrLn "Testing Mersenne Twister." let mt = mersenneTwister 100 w = take 1 (drop 9999999 mt) -- w = take 1 (drop 99 mt) putStrLn $ "Result is " ++ (show w) ----------------------------- ----------------------------- testMT.hs module Main where -- Compile eg with -- ghc -O3 -optc-O3 -optc-ffast-math -fexcess-precision --make testMT import Mersenne genRNums32 :: MT32 -> Int -> IO (MT32) genRNums32 mt nCnt = gRN mt nCnt where gRN :: MT32 -> Int -> IO (MT32) gRN mt nCnt | mt `seq` nCnt `seq` False = undefined gRN mt 1 = do (r,mt') <- next32 mt putStrLn $ (show r) return mt' gRN mt nCnt = do (r,mt') <- next32 mt gRN mt' $! (nCnt-1) main = do putStrLn "Testing Mersenne Twister." mt32 <- initialiseGenerator32 100 genRNums32 mt32 10000000 ----------------------------- ----------------------------- Mersenne.hs (sorry for linewraps) module Mersenne where import Data.Bits import Data.Word import Data.Array.Base import Data.Array.MArray import Data.Array.IO -- import System.Random data MT32 = MT32 (IOUArray Int Word32) Int data MT64 = MT64 (IOUArray Int Word64) Int last32bitsof :: Word32 -> Word32 last32bitsof a = a .&. 0xffffffff -- == (2^32-1) lm32 = 0x7fffffff :: Word32 um32 = 0x80000000 :: Word32 mA32 = 0x9908b0df :: Word32 -- == 2567483615 -- Array of length 624. initialiseGenerator32 :: Int -> IO MT32 initialiseGenerator32 seed = do let s = last32bitsof (fromIntegral seed)::Word32 mt <- newArray (0,623) (0::Word32) unsafeWrite mt 0 s iG mt s 1 mt' <- generateNumbers32 mt return (MT32 mt' 0) where iG :: (IOUArray Int Word32) -> Word32 -> Int -> IO (IOUArray Int Word32) iG mt lastNro n | n == 624 = return mt | otherwise = do let n1 = lastNro `xor` (shiftR lastNro 30) new = (1812433253 * n1 + (fromIntegral n)::Word32) unsafeWrite mt n new iG mt new (n+1) generateNumbers32 :: (IOUArray Int Word32) -> IO (IOUArray Int Word32) generateNumbers32 mt = gLoop 0 mt where gLoop :: Int -> (IOUArray Int Word32) -> IO (IOUArray Int Word32) gLoop i mt | i==623 = do wL <- unsafeRead mt 623 w0 <- unsafeRead mt 0 w396 <- unsafeRead mt 396 let y = (wL .&. um32) .|. (w0 .&. lm32) :: Word32 if even y then unsafeWrite mt 623 (w396 `xor` (shiftR y 1)) else unsafeWrite mt 623 (w396 `xor` (shiftR y 1) `xor` mA32) return mt | otherwise = do wi <- unsafeRead mt i wi1 <- unsafeRead mt (i+1) w3 <- unsafeRead mt ((i+397) `mod` 624) let y = (wi .&. um32) .|. (wi1 .&. lm32) if even y then unsafeWrite mt i (w3 `xor` (shiftR y 1)) else unsafeWrite mt i (w3 `xor` (shiftR y 1) `xor` mA32) gLoop (i+1) mt next32 :: MT32 -> IO (Word32, MT32) next32 (MT32 mt i) | i >= 624 = do mt' <- generateNumbers32 mt let m = MT32 mt' (i `mod` 624) (w,m') <- next32 m return (w,m') | otherwise = do y <- unsafeRead mt i let y1 = y `xor` (shiftR y 11) y2 = y1 `xor` ((shiftL y1 7 ) .&. 0x9d2c5680) -- == 2636928640 y3 = y2 `xor` ((shiftL y2 15) .&. 0xefc60000) -- == 4022730752 y4 = y3 `xor` (shiftR y3 18) return $ (y4, MT32 mt (i+1)) mA64 = 0xB5026F5AA96619E9 :: Word64 um64 = 0xFFFFFFFF80000000 :: Word64 lm64 = 0x7FFFFFFF :: Word64 initialiseGenerator64 :: Int -> IO (MT64) initialiseGenerator64 seed = do let s = (fromIntegral seed)::Word64 mt <- newArray (0,311) (0::Word64) unsafeWrite mt 0 s iG mt s 1 generateNumbers64 mt return (MT64 mt 0) where iG :: (IOUArray Int Word64) -> Word64 -> Int -> IO (IOUArray Int Word64) iG mt lN i | mt `seq` lN `seq` i `seq` False = undefined iG mt lastNro 312 = return mt iG mt lastNro n = do let n1 = lastNro `xor` (shiftR lastNro 62) new = (6364136223846793005 * n1 + (fromIntegral n)::Word64) unsafeWrite mt n new iG mt new $! (n+1) generateNumbers64 :: (IOUArray Int Word64) -> IO () generateNumbers64 mt = gLoop 0 where gLoop :: Int -> IO () gLoop i | i `seq` False = undefined gLoop 311 = do wL <- unsafeRead mt 311 w0 <- unsafeRead mt 0 w155 <- unsafeRead mt 155 let y = (wL .&. um64) .|. (w0 .&. lm64) :: Word64 if even y then unsafeWrite mt 311 (w155 `xor` (shiftR y 1)) else unsafeWrite mt 311 (w155 `xor` (shiftR y 1) `xor` mA64) return () gLoop i = do wi <- unsafeRead mt i wi1 <- unsafeRead mt (i+1) w3 <- unsafeRead mt ((i+156) `mod` 312) let y = (wi .&. um64) .|. (wi1 .&. lm64) if even y then unsafeWrite mt i (w3 `xor` (shiftR y 1)) else unsafeWrite mt i (w3 `xor` (shiftR y 1) `xor` mA64) gLoop $! (i+1) next64 :: MT64 -> IO (Word64, MT64) next64 (MT64 mt 312) = do generateNumbers64 mt let m = MT64 mt 0 (w,m') <- next64 m return (w,m') next64 (MT64 mt i) = do y <- unsafeRead mt i let y1 = y `xor` ((shiftR y 29) .&. 0x5555555555555555) y2 = y1 `xor` ((shiftL y1 17) .&. 0x71D67FFFEDA60000) y3 = y2 `xor` ((shiftL y2 37) .&. 0xFFF7EEE000000000) y4 = y3 `xor` (shiftR y3 43) return $! (y4, MT64 mt (i+1))

Now, this will be hard to get close the the highly tuned C. Possibly its doable. The main tricks are documented here: http://haskell.org/haskellwiki/Performance/GHC#Unboxed_types Inspecting the Core to ensure the math is being inlined and unboxed will be the most crucial issue, I'd imagine. Then again, an FFI binding to mersenne.c is also a good idea :) -- Don isto.aho:
Hi all,
On HaWiki was an announcement of MersenneTwister made by Lennart Augustsson. On a typical run to find out 10000000th rnd num the output is (code shown below):
$ time ./testMTla Testing Mersenne Twister. Result is [3063349438]
real 0m4.925s user 0m4.856s
I was exercising with the very same algorithm and tried to make it efficient (by using IOUArray): now a typical run looks like (code shown below):
$ time ./testMT Testing Mersenne Twister. 3063349438
real 0m3.032s user 0m3.004s
The original C-version (modified so that only the last number is shown) gives typically
$ time ./mt19937ar outputs of genrand_int32() 3063349438
real 0m0.624s user 0m0.616s
Results are similar with 64 bit IOUArray against 64 bit C variant. C seems to work about 5 to 10 times faster in this case.
I have tried to do different things but now I'm stuck. unsafeRead and unsafeWrite improved a bit the lazy (STUArray-version) and IOUArray-versions but not very much. I took a look of Core file but then, I'm not sure where the boxed values are ok. E.g. should IOUArray Int Word64 be replaced with something else?
Any hints and comments on how to improve the efficiency and make everything better will be appreciated a lot!
br, Isto
----------------------------- testMTla.hs (MersenneTwister, see HaWiki) module Main where
-- ghc -O3 -optc-O3 -optc-ffast-math -fexcess-precision --make testMTla
import MersenneTwister
main = do putStrLn "Testing Mersenne Twister." let mt = mersenneTwister 100 w = take 1 (drop 9999999 mt) -- w = take 1 (drop 99 mt) putStrLn $ "Result is " ++ (show w) -----------------------------
----------------------------- testMT.hs module Main where
-- Compile eg with -- ghc -O3 -optc-O3 -optc-ffast-math -fexcess-precision --make testMT
import Mersenne
genRNums32 :: MT32 -> Int -> IO (MT32) genRNums32 mt nCnt = gRN mt nCnt where gRN :: MT32 -> Int -> IO (MT32) gRN mt nCnt | mt `seq` nCnt `seq` False = undefined gRN mt 1 = do (r,mt') <- next32 mt putStrLn $ (show r) return mt' gRN mt nCnt = do (r,mt') <- next32 mt gRN mt' $! (nCnt-1)
main = do putStrLn "Testing Mersenne Twister." mt32 <- initialiseGenerator32 100 genRNums32 mt32 10000000 -----------------------------
----------------------------- Mersenne.hs (sorry for linewraps) module Mersenne where
import Data.Bits import Data.Word import Data.Array.Base import Data.Array.MArray import Data.Array.IO -- import System.Random
data MT32 = MT32 (IOUArray Int Word32) Int data MT64 = MT64 (IOUArray Int Word64) Int
last32bitsof :: Word32 -> Word32 last32bitsof a = a .&. 0xffffffff -- == (2^32-1)
lm32 = 0x7fffffff :: Word32 um32 = 0x80000000 :: Word32 mA32 = 0x9908b0df :: Word32 -- == 2567483615
-- Array of length 624. initialiseGenerator32 :: Int -> IO MT32 initialiseGenerator32 seed = do let s = last32bitsof (fromIntegral seed)::Word32 mt <- newArray (0,623) (0::Word32) unsafeWrite mt 0 s iG mt s 1 mt' <- generateNumbers32 mt return (MT32 mt' 0) where iG :: (IOUArray Int Word32) -> Word32 -> Int -> IO (IOUArray Int Word32) iG mt lastNro n | n == 624 = return mt | otherwise = do let n1 = lastNro `xor` (shiftR lastNro 30) new = (1812433253 * n1 + (fromIntegral n)::Word32) unsafeWrite mt n new iG mt new (n+1)
generateNumbers32 :: (IOUArray Int Word32) -> IO (IOUArray Int Word32) generateNumbers32 mt = gLoop 0 mt where gLoop :: Int -> (IOUArray Int Word32) -> IO (IOUArray Int Word32) gLoop i mt | i==623 = do wL <- unsafeRead mt 623 w0 <- unsafeRead mt 0 w396 <- unsafeRead mt 396 let y = (wL .&. um32) .|. (w0 .&. lm32) :: Word32 if even y then unsafeWrite mt 623 (w396 `xor` (shiftR y 1)) else unsafeWrite mt 623 (w396 `xor` (shiftR y 1) `xor` mA32) return mt | otherwise = do wi <- unsafeRead mt i wi1 <- unsafeRead mt (i+1) w3 <- unsafeRead mt ((i+397) `mod` 624) let y = (wi .&. um32) .|. (wi1 .&. lm32) if even y then unsafeWrite mt i (w3 `xor` (shiftR y 1)) else unsafeWrite mt i (w3 `xor` (shiftR y 1) `xor` mA32) gLoop (i+1) mt
next32 :: MT32 -> IO (Word32, MT32) next32 (MT32 mt i) | i >= 624 = do mt' <- generateNumbers32 mt let m = MT32 mt' (i `mod` 624) (w,m') <- next32 m return (w,m') | otherwise = do y <- unsafeRead mt i let y1 = y `xor` (shiftR y 11) y2 = y1 `xor` ((shiftL y1 7 ) .&. 0x9d2c5680) -- == 2636928640 y3 = y2 `xor` ((shiftL y2 15) .&. 0xefc60000) -- == 4022730752 y4 = y3 `xor` (shiftR y3 18) return $ (y4, MT32 mt (i+1))
mA64 = 0xB5026F5AA96619E9 :: Word64 um64 = 0xFFFFFFFF80000000 :: Word64 lm64 = 0x7FFFFFFF :: Word64
initialiseGenerator64 :: Int -> IO (MT64) initialiseGenerator64 seed = do let s = (fromIntegral seed)::Word64 mt <- newArray (0,311) (0::Word64) unsafeWrite mt 0 s iG mt s 1 generateNumbers64 mt return (MT64 mt 0) where iG :: (IOUArray Int Word64) -> Word64 -> Int -> IO (IOUArray Int Word64) iG mt lN i | mt `seq` lN `seq` i `seq` False = undefined iG mt lastNro 312 = return mt iG mt lastNro n = do let n1 = lastNro `xor` (shiftR lastNro 62) new = (6364136223846793005 * n1 + (fromIntegral n)::Word64) unsafeWrite mt n new iG mt new $! (n+1)
generateNumbers64 :: (IOUArray Int Word64) -> IO () generateNumbers64 mt = gLoop 0 where gLoop :: Int -> IO () gLoop i | i `seq` False = undefined gLoop 311 = do wL <- unsafeRead mt 311 w0 <- unsafeRead mt 0 w155 <- unsafeRead mt 155 let y = (wL .&. um64) .|. (w0 .&. lm64) :: Word64 if even y then unsafeWrite mt 311 (w155 `xor` (shiftR y 1)) else unsafeWrite mt 311 (w155 `xor` (shiftR y 1) `xor` mA64) return () gLoop i = do wi <- unsafeRead mt i wi1 <- unsafeRead mt (i+1) w3 <- unsafeRead mt ((i+156) `mod` 312) let y = (wi .&. um64) .|. (wi1 .&. lm64) if even y then unsafeWrite mt i (w3 `xor` (shiftR y 1)) else unsafeWrite mt i (w3 `xor` (shiftR y 1) `xor` mA64) gLoop $! (i+1)
next64 :: MT64 -> IO (Word64, MT64) next64 (MT64 mt 312) = do generateNumbers64 mt let m = MT64 mt 0 (w,m') <- next64 m return (w,m') next64 (MT64 mt i) = do y <- unsafeRead mt i let y1 = y `xor` ((shiftR y 29) .&. 0x5555555555555555) y2 = y1 `xor` ((shiftL y1 17) .&. 0x71D67FFFEDA60000) y3 = y2 `xor` ((shiftL y2 37) .&. 0xFFF7EEE000000000) y4 = y3 `xor` (shiftR y3 43) return $! (y4, MT64 mt (i+1))
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

A big problem with the Mersenne Twister is the shifts. As has been noted elsewhere, ghc doesn't do such a great job on those. -- Lennart On Nov 1, 2006, at 20:17 , Donald Bruce Stewart wrote:
Now, this will be hard to get close the the highly tuned C. Possibly its doable.
The main tricks are documented here: http://haskell.org/haskellwiki/Performance/GHC#Unboxed_types
Inspecting the Core to ensure the math is being inlined and unboxed will be the most crucial issue, I'd imagine.
Then again, an FFI binding to mersenne.c is also a good idea :)
-- Don
isto.aho:
Hi all,
On HaWiki was an announcement of MersenneTwister made by Lennart Augustsson. On a typical run to find out 10000000th rnd num the output is (code shown below):
$ time ./testMTla Testing Mersenne Twister. Result is [3063349438]
real 0m4.925s user 0m4.856s
I was exercising with the very same algorithm and tried to make it efficient (by using IOUArray): now a typical run looks like (code shown below):
$ time ./testMT Testing Mersenne Twister. 3063349438
real 0m3.032s user 0m3.004s
The original C-version (modified so that only the last number is shown) gives typically
$ time ./mt19937ar outputs of genrand_int32() 3063349438
real 0m0.624s user 0m0.616s
Results are similar with 64 bit IOUArray against 64 bit C variant. C seems to work about 5 to 10 times faster in this case.
I have tried to do different things but now I'm stuck. unsafeRead and unsafeWrite improved a bit the lazy (STUArray-version) and IOUArray-versions but not very much. I took a look of Core file but then, I'm not sure where the boxed values are ok. E.g. should IOUArray Int Word64 be replaced with something else?
Any hints and comments on how to improve the efficiency and make everything better will be appreciated a lot!
br, Isto
----------------------------- testMTla.hs (MersenneTwister, see HaWiki) module Main where
-- ghc -O3 -optc-O3 -optc-ffast-math -fexcess-precision --make testMTla
import MersenneTwister
main = do putStrLn "Testing Mersenne Twister." let mt = mersenneTwister 100 w = take 1 (drop 9999999 mt) -- w = take 1 (drop 99 mt) putStrLn $ "Result is " ++ (show w) -----------------------------
----------------------------- testMT.hs module Main where
-- Compile eg with -- ghc -O3 -optc-O3 -optc-ffast-math -fexcess-precision --make testMT
import Mersenne
genRNums32 :: MT32 -> Int -> IO (MT32) genRNums32 mt nCnt = gRN mt nCnt where gRN :: MT32 -> Int -> IO (MT32) gRN mt nCnt | mt `seq` nCnt `seq` False = undefined gRN mt 1 = do (r,mt') <- next32 mt putStrLn $ (show r) return mt' gRN mt nCnt = do (r,mt') <- next32 mt gRN mt' $! (nCnt-1)
main = do putStrLn "Testing Mersenne Twister." mt32 <- initialiseGenerator32 100 genRNums32 mt32 10000000 -----------------------------
----------------------------- Mersenne.hs (sorry for linewraps) module Mersenne where
import Data.Bits import Data.Word import Data.Array.Base import Data.Array.MArray import Data.Array.IO -- import System.Random
data MT32 = MT32 (IOUArray Int Word32) Int data MT64 = MT64 (IOUArray Int Word64) Int
last32bitsof :: Word32 -> Word32 last32bitsof a = a .&. 0xffffffff -- == (2^32-1)
lm32 = 0x7fffffff :: Word32 um32 = 0x80000000 :: Word32 mA32 = 0x9908b0df :: Word32 -- == 2567483615
-- Array of length 624. initialiseGenerator32 :: Int -> IO MT32 initialiseGenerator32 seed = do let s = last32bitsof (fromIntegral seed)::Word32 mt <- newArray (0,623) (0::Word32) unsafeWrite mt 0 s iG mt s 1 mt' <- generateNumbers32 mt return (MT32 mt' 0) where iG :: (IOUArray Int Word32) -> Word32 -> Int -> IO (IOUArray Int Word32) iG mt lastNro n | n == 624 = return mt | otherwise = do let n1 = lastNro `xor` (shiftR lastNro 30) new = (1812433253 * n1 + (fromIntegral n)::Word32) unsafeWrite mt n new iG mt new (n+1)
generateNumbers32 :: (IOUArray Int Word32) -> IO (IOUArray Int Word32) generateNumbers32 mt = gLoop 0 mt where gLoop :: Int -> (IOUArray Int Word32) -> IO (IOUArray Int Word32) gLoop i mt | i==623 = do wL <- unsafeRead mt 623 w0 <- unsafeRead mt 0 w396 <- unsafeRead mt 396 let y = (wL .&. um32) .|. (w0 .&. lm32) :: Word32 if even y then unsafeWrite mt 623 (w396 `xor` (shiftR y 1)) else unsafeWrite mt 623 (w396 `xor` (shiftR y 1) `xor` mA32) return mt | otherwise = do wi <- unsafeRead mt i wi1 <- unsafeRead mt (i+1) w3 <- unsafeRead mt ((i+397) `mod` 624) let y = (wi .&. um32) .|. (wi1 .&. lm32) if even y then unsafeWrite mt i (w3 `xor` (shiftR y 1)) else unsafeWrite mt i (w3 `xor` (shiftR y 1) `xor` mA32) gLoop (i+1) mt
next32 :: MT32 -> IO (Word32, MT32) next32 (MT32 mt i) | i >= 624 = do mt' <- generateNumbers32 mt let m = MT32 mt' (i `mod` 624) (w,m') <- next32 m return (w,m') | otherwise = do y <- unsafeRead mt i let y1 = y `xor` (shiftR y 11) y2 = y1 `xor` ((shiftL y1 7 ) .&. 0x9d2c5680) -- == 2636928640 y3 = y2 `xor` ((shiftL y2 15) .&. 0xefc60000) -- == 4022730752 y4 = y3 `xor` (shiftR y3 18) return $ (y4, MT32 mt (i+1))
mA64 = 0xB5026F5AA96619E9 :: Word64 um64 = 0xFFFFFFFF80000000 :: Word64 lm64 = 0x7FFFFFFF :: Word64
initialiseGenerator64 :: Int -> IO (MT64) initialiseGenerator64 seed = do let s = (fromIntegral seed)::Word64 mt <- newArray (0,311) (0::Word64) unsafeWrite mt 0 s iG mt s 1 generateNumbers64 mt return (MT64 mt 0) where iG :: (IOUArray Int Word64) -> Word64 -> Int -> IO (IOUArray Int Word64) iG mt lN i | mt `seq` lN `seq` i `seq` False = undefined iG mt lastNro 312 = return mt iG mt lastNro n = do let n1 = lastNro `xor` (shiftR lastNro 62) new = (6364136223846793005 * n1 + (fromIntegral n)::Word64) unsafeWrite mt n new iG mt new $! (n+1)
generateNumbers64 :: (IOUArray Int Word64) -> IO () generateNumbers64 mt = gLoop 0 where gLoop :: Int -> IO () gLoop i | i `seq` False = undefined gLoop 311 = do wL <- unsafeRead mt 311 w0 <- unsafeRead mt 0 w155 <- unsafeRead mt 155 let y = (wL .&. um64) .|. (w0 .&. lm64) :: Word64 if even y then unsafeWrite mt 311 (w155 `xor` (shiftR y 1)) else unsafeWrite mt 311 (w155 `xor` (shiftR y 1) `xor` mA64) return () gLoop i = do wi <- unsafeRead mt i wi1 <- unsafeRead mt (i+1) w3 <- unsafeRead mt ((i+156) `mod` 312) let y = (wi .&. um64) .|. (wi1 .&. lm64) if even y then unsafeWrite mt i (w3 `xor` (shiftR y 1)) else unsafeWrite mt i (w3 `xor` (shiftR y 1) `xor` mA64) gLoop $! (i+1)
next64 :: MT64 -> IO (Word64, MT64) next64 (MT64 mt 312) = do generateNumbers64 mt let m = MT64 mt 0 (w,m') <- next64 m return (w,m') next64 (MT64 mt i) = do y <- unsafeRead mt i let y1 = y `xor` ((shiftR y 29) .&. 0x5555555555555555) y2 = y1 `xor` ((shiftL y1 17) .&. 0x71D67FFFEDA60000) y3 = y2 `xor` ((shiftL y2 37) .&. 0xFFF7EEE000000000) y4 = y3 `xor` (shiftR y3 43) return $! (y4, MT64 mt (i+1))
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 11/2/06, Lennart Augustsson
A big problem with the Mersenne Twister is the shifts. As has been noted elsewhere, ghc doesn't do such a great job on those.
Actually, the shifts are only evaluated once (hurrah for lazy evaluation) and with -funfolding-use-threshold=16 they're all compiled to unchecked primitives (GHC.Prim.uncheckedShiftRL#). -- Cheers, Lemmih

Hello Lennart, Thursday, November 2, 2006, 4:34:04 AM, you wrote:
A big problem with the Mersenne Twister is the shifts. As has been noted elsewhere, ghc doesn't do such a great job on those.
#ifdef __GLASGOW_HASKELL__ (I# a) <<# (I# b) = (I# (a `iShiftL#` b)) (I# a) >># (I# b) = (I# (a `uncheckedIShiftRL#` b)) #else /* ! __GLASGOW_HASKELL__ */ a <<# b = a `shiftL` b a >># b = a `shiftR` b #endif /* ! __GLASGOW_HASKELL__ */ -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 11/1/06, isto
Hi all,
On HaWiki was an announcement of MersenneTwister made by Lennart Augustsson. On a typical run to find out 10000000th rnd num the output is (code shown below):
$ time ./testMTla Testing Mersenne Twister. Result is [3063349438]
real 0m4.925s user 0m4.856s
I was exercising with the very same algorithm and tried to make it efficient (by using IOUArray): now a typical run looks like (code shown below):
$ time ./testMT Testing Mersenne Twister. 3063349438
real 0m3.032s user 0m3.004s
The original C-version (modified so that only the last number is shown) gives typically
$ time ./mt19937ar outputs of genrand_int32() 3063349438
real 0m0.624s user 0m0.616s
Results are similar with 64 bit IOUArray against 64 bit C variant. C seems to work about 5 to 10 times faster in this case.
I have tried to do different things but now I'm stuck. unsafeRead and unsafeWrite improved a bit the lazy (STUArray-version) and IOUArray-versions but not very much. I took a look of Core file but then, I'm not sure where the boxed values are ok. E.g. should IOUArray Int Word64 be replaced with something else?
Any hints and comments on how to improve the efficiency and make everything better will be appreciated a lot!
br, Isto
Greetings, Applying a few optimations can make it about 3x faster. 1. Hoist the array out of your loops. (See generateNumbers32, initialiseGenerator32 and genRNums). 2. Don't create too many new MT32 boxes. Most of the time is spent in 'next32' and changing its type to 'IOUArray Int Word32 -> Int -> IO (Word32, Int)' makes it much faster. 3. Demand more inlining. If you're using GHC, -funfolding-use-threshold=16 will substantially improve the performance. Using 'seq' is generally a bad idea. It can worsen the performance if not used carefully and GHCs strictness analyser is usually good enough. I used the profiler and -ddump-simpl to analyse this program. Donald suggested manual unboxing. However, in this case it won't help much (if at all) since GHC is doing such an excellent job on its own. -- Cheers, Lemmih

The whole point of writing the Mersenne Twister was that I wanted to show how a stateful computation could be encapsulated in the ST monad and none of it showing up outside. This aspect of the code is totally gone now when everything is in the IO monad. Is there some good reason to have it in the IO monad? -- Lennart On Nov 1, 2006, at 20:51 , Lemmih wrote:
On 11/1/06, isto
wrote: Hi all,
On HaWiki was an announcement of MersenneTwister made by Lennart Augustsson. On a typical run to find out 10000000th rnd num the output is (code shown below):
$ time ./testMTla Testing Mersenne Twister. Result is [3063349438]
real 0m4.925s user 0m4.856s
I was exercising with the very same algorithm and tried to make it efficient (by using IOUArray): now a typical run looks like (code shown below):
$ time ./testMT Testing Mersenne Twister. 3063349438
real 0m3.032s user 0m3.004s
The original C-version (modified so that only the last number is shown) gives typically
$ time ./mt19937ar outputs of genrand_int32() 3063349438
real 0m0.624s user 0m0.616s
Results are similar with 64 bit IOUArray against 64 bit C variant. C seems to work about 5 to 10 times faster in this case.
I have tried to do different things but now I'm stuck. unsafeRead and unsafeWrite improved a bit the lazy (STUArray-version) and IOUArray-versions but not very much. I took a look of Core file but then, I'm not sure where the boxed values are ok. E.g. should IOUArray Int Word64 be replaced with something else?
Any hints and comments on how to improve the efficiency and make everything better will be appreciated a lot!
br, Isto
Greetings,
Applying a few optimations can make it about 3x faster.
1. Hoist the array out of your loops. (See generateNumbers32, initialiseGenerator32 and genRNums). 2. Don't create too many new MT32 boxes. Most of the time is spent in 'next32' and changing its type to 'IOUArray Int Word32 -> Int -> IO (Word32, Int)' makes it much faster. 3. Demand more inlining. If you're using GHC, -funfolding-use-threshold=16 will substantially improve the performance.
Using 'seq' is generally a bad idea. It can worsen the performance if not used carefully and GHCs strictness analyser is usually good enough. I used the profiler and -ddump-simpl to analyse this program.
Donald suggested manual unboxing. However, in this case it won't help much (if at all) since GHC is doing such an excellent job on its own.
-- Cheers, Lemmih _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hi, When writing IO version, I wasn't aware of other twister versions, and the only reason is/was that it was easiest to me and that I knew (believed) that plain lists would have been inefficient. I just wanted to see and learn, how close to C version this can be made. (And still do.) There were some good suggestions on this thread - next I'll try to get grasp on how to apply the suggestions and do something... br, Isto ke, 2006-11-01 kello 22:04 -0500, Lennart Augustsson kirjoitti:
The whole point of writing the Mersenne Twister was that I wanted to show how a stateful computation could be encapsulated in the ST monad and none of it showing up outside. This aspect of the code is totally gone now when everything is in the IO monad. Is there some good reason to have it in the IO monad?

Oh, sorry, I thought your version was a rewritten version of mine. :) The names are so similar, after all. On Nov 2, 2006, at 02:26 , isto wrote:
Hi,
When writing IO version, I wasn't aware of other twister versions, and the only reason is/was that it was easiest to me and that I knew (believed) that plain lists would have been inefficient. I just wanted to see and learn, how close to C version this can be made. (And still do.)
There were some good suggestions on this thread - next I'll try to get grasp on how to apply the suggestions and do something...
br, Isto
ke, 2006-11-01 kello 22:04 -0500, Lennart Augustsson kirjoitti:
The whole point of writing the Mersenne Twister was that I wanted to show how a stateful computation could be encapsulated in the ST monad and none of it showing up outside. This aspect of the code is totally gone now when everything is in the IO monad. Is there some good reason to have it in the IO monad?

Hi & no problems (I didn't tell it clearly right away), I modified the code along the comments given by Lemmih and things improved a lot. mod-operator can be removed by two loops as in C version, which still further improved the speed. I tried this with the old version and the speed-up was next to nothing. The new version can be found below. Now, the results are comparable to the C version. On the second run it was even better, but usually it is about 0.78 on my machine. $ time ./testMT Testing Mersenne Twister. 3063349438 real 0m0.791s user 0m0.776s sys 0m0.008s $ time ./testMT Testing Mersenne Twister. 3063349438 real 0m0.414s user 0m0.400s Can somebody tell, why this happens? There were similar timings with ST-version and with the old version (so that it sometimes runs almost twice as fast or twice as slow than normally). Is this something system dependent (amd64/ubuntu)? C version seems to be within 10% that is between 0.60 and 0.66. Anyhow, this random generator now seems to randomly outperform this particular C version :) C version is located at http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html About Lemmih's proposals. Did I follow them correctly? I'm using ghc and there the inlining demanding seemed to work. With =100 there are more often 0.4 timings than with =16, but also =100 gives 0.78 seconds often (majority of runs). And can something still be done? At this point I would leave FFI and other languages out as one goal here is to learn Haskelling and somehow solutions that require writing code only in one language sounds nicer. br, Isto to, 2006-11-02 kello 08:18 -0500, Lennart Augustsson kirjoitti:
Oh, sorry, I thought your version was a rewritten version of mine. :) The names are so similar, after all.
-------------------------- Mersenne.hs (64bit version removed) module Mersenne where import Data.Bits import Data.Word import Data.Array.Base import Data.Array.MArray import Data.Array.IO data MT32 = MT32 (IOUArray Int Word32) Int last32bitsof :: Word32 -> Word32 last32bitsof a = a .&. 0xffffffff -- == (2^32-1) lm32 = 0x7fffffff :: Word32 um32 = 0x80000000 :: Word32 mA32 = 0x9908b0df :: Word32 -- == 2567483615 -- Array of length 624. initialiseGenerator32 :: Int -> IO MT32 initialiseGenerator32 seed = do let s = last32bitsof (fromIntegral seed)::Word32 mt <- newArray (0,623) (0::Word32) unsafeWrite mt 0 s mtLoop mt s 1 generateNumbers32 mt return (MT32 mt 0) mtLoop :: (IOUArray Int Word32) -> Word32 -> Int -> IO (IOUArray Int Word32) mtLoop mt lastNro n = loop lastNro n where loop :: Word32 -> Int -> IO (IOUArray Int Word32) loop lastNro n | n == 624 = return mt | otherwise = do let n1 = lastNro `xor` (shiftR lastNro 30) new = (1812433253 * n1 + (fromIntegral n)::Word32) unsafeWrite mt n new loop new $! (n+1) generateNumbers32 :: (IOUArray Int Word32) -> IO () generateNumbers32 mt = do gLoop1 0 gLoop2 227 wL <- unsafeRead mt 623 w0 <- unsafeRead mt 0 w396 <- unsafeRead mt 396 let y = (wL .&. um32) .|. (w0 .&. lm32) :: Word32 if even y then unsafeWrite mt 623 (w396 `xor` (shiftR y 1)) else unsafeWrite mt 623 (w396 `xor` (shiftR y 1) `xor` mA32) return () where gLoop1 :: Int -> IO () gLoop1 227 = return () gLoop1 i = do wi <- unsafeRead mt i wi1 <- unsafeRead mt (i+1) -- w3 <- unsafeRead mt ((i+397) `mod` 624) w3 <- unsafeRead mt (i+397) let y = (wi .&. um32) .|. (wi1 .&. lm32) if even y then unsafeWrite mt i (w3 `xor` (shiftR y 1)) else unsafeWrite mt i (w3 `xor` (shiftR y 1) `xor` mA32) gLoop1 $! (i+1) gLoop2 :: Int -> IO () gLoop2 623 = return () gLoop2 i = do wi <- unsafeRead mt i wi1 <- unsafeRead mt (i+1) -- w3 <- unsafeRead mt ((i+397) `mod` 624) w3 <- unsafeRead mt (i-227) let y = (wi .&. um32) .|. (wi1 .&. lm32) if even y then unsafeWrite mt i (w3 `xor` (shiftR y 1)) else unsafeWrite mt i (w3 `xor` (shiftR y 1) `xor` mA32) gLoop2 $! (i+1) {- gLoop :: Int -> IO () gLoop 623 = do wL <- unsafeRead mt 623 w0 <- unsafeRead mt 0 w396 <- unsafeRead mt 396 let y = (wL .&. um32) .|. (w0 .&. lm32) :: Word32 if even y then unsafeWrite mt 623 (w396 `xor` (shiftR y 1)) else unsafeWrite mt 623 (w396 `xor` (shiftR y 1) `xor` mA32) return () gLoop i = do wi <- unsafeRead mt i wi1 <- unsafeRead mt (i+1) w3 <- unsafeRead mt ((i+397) `mod` 624) let y = (wi .&. um32) .|. (wi1 .&. lm32) if even y then unsafeWrite mt i (w3 `xor` (shiftR y 1)) else unsafeWrite mt i (w3 `xor` (shiftR y 1) `xor` mA32) gLoop $! (i+1) -} -- next32 :: MT32 -> IO (Word32, MT32) -- next32 (MT32 mt 624) = do -- next32 (MT32 mt i) = do next32 :: IOUArray Int Word32 -> Int -> IO (Word32, Int) next32 mt 624 = do generateNumbers32 mt -- let m = MT32 mt 0 -- (w,m') <- next32 m -- return (w,m') (w,iN) <- next32 mt 0 return (w,iN) next32 mt i = do y <- unsafeRead mt i let y1 = y `xor` (shiftR y 11) y2 = y1 `xor` ((shiftL y1 7 ) .&. 0x9d2c5680) -- == 2636928640 y3 = y2 `xor` ((shiftL y2 15) .&. 0xefc60000) -- == 4022730752 y4 = y3 `xor` (shiftR y3 18) return $ (y4, (i+1)) -- return $ (y4, MT32 mt (i+1)) -------------------------- -------------------------- testMT.hs module Main where -- Compile eg with -- ghc -O3 -optc-O3 -optc-ffast-math -fexcess-precision -- -funfolding-use-threshold=16 --make testMT import Mersenne import GHC.Prim -- genRNums32 :: MT32 -> Int -> IO Int -- genRNums32 mt nCnt = gRN mt nCnt genRNums32 :: MT32 -> Int -> IO MT32 genRNums32 (MT32 mt pos) nCnt = gRN nCnt pos where gRN :: Int -> Int -> IO MT32 gRN 1 iCurr = do (r,iNew) <- next32 mt iCurr putStrLn $ (show r) return (MT32 mt iNew) gRN nCnt iCurr = do (_,iNew) <- next32 mt iCurr gRN (nCnt-1) $! iNew main = do putStrLn "Testing Mersenne Twister." mt32 <- initialiseGenerator32 100 genRNums32 mt32 10000000 --------------------------

Hello Lennart, Thursday, November 2, 2006, 6:04:39 AM, you wrote:
The whole point of writing the Mersenne Twister was that I wanted to show how a stateful computation could be encapsulated in the ST monad and none of it showing up outside. This aspect of the code is totally gone now when everything is in the IO monad. Is there some good reason to have it in the IO monad?
i think no. ST computations internally are really the same as IO ones -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

lemmih:
On 11/1/06, isto
wrote: Hi all,
On HaWiki was an announcement of MersenneTwister made by Lennart Augustsson. On a typical run to find out 10000000th rnd num the output is (code shown below):
$ time ./testMTla Testing Mersenne Twister. Result is [3063349438]
real 0m4.925s user 0m4.856s
I was exercising with the very same algorithm and tried to make it efficient (by using IOUArray): now a typical run looks like (code shown below):
$ time ./testMT Testing Mersenne Twister. 3063349438
real 0m3.032s user 0m3.004s
The original C-version (modified so that only the last number is shown) gives typically
$ time ./mt19937ar outputs of genrand_int32() 3063349438
real 0m0.624s user 0m0.616s
Results are similar with 64 bit IOUArray against 64 bit C variant. C seems to work about 5 to 10 times faster in this case.
I have tried to do different things but now I'm stuck. unsafeRead and unsafeWrite improved a bit the lazy (STUArray-version) and IOUArray-versions but not very much. I took a look of Core file but then, I'm not sure where the boxed values are ok. E.g. should IOUArray Int Word64 be replaced with something else?
Any hints and comments on how to improve the efficiency and make everything better will be appreciated a lot!
br, Isto
Greetings,
Applying a few optimations can make it about 3x faster.
1. Hoist the array out of your loops. (See generateNumbers32, initialiseGenerator32 and genRNums). 2. Don't create too many new MT32 boxes. Most of the time is spent in 'next32' and changing its type to 'IOUArray Int Word32 -> Int -> IO (Word32, Int)' makes it much faster. 3. Demand more inlining. If you're using GHC, -funfolding-use-threshold=16 will substantially improve the performance.
Using 'seq' is generally a bad idea. It can worsen the performance if not used carefully and GHCs strictness analyser is usually good enough. I used the profiler and -ddump-simpl to analyse this program.
Donald suggested manual unboxing. However, in this case it won't help much (if at all) since GHC is doing such an excellent job on its own.
I wasn't suggesting manual unboxing, more that you should carefully inspect the Core, and tune with bang patterns where necessary. -funfolding-use-threshold=16 is a good idea though. or =100 ;) -- Don

Hello isto, Thursday, November 2, 2006, 1:16:55 AM, you wrote:
I have tried to do different things but now I'm stuck. unsafeRead and unsafeWrite improved a bit the lazy (STUArray-version) and
why you think it's a lazy? :) ST monad is just the same as IO monad internally, only types are different (there is also Lazy.ST monad - this is really lazy) 10-20 times difference is typical for GHC programs. i think that better results are observed only when C version is really bound by memory speed (that is rather typical situation) use JHC if you need to generate really fast code. or use ocaml/clean -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
Hello isto,
Thursday, November 2, 2006, 1:16:55 AM, you wrote:
I have tried to do different things but now I'm stuck. unsafeRead and unsafeWrite improved a bit the lazy (STUArray-version) and
why you think it's a lazy? :) ST monad is just the same as IO monad internally, only types are different (there is also Lazy.ST monad - this is really lazy)
10-20 times difference is typical for GHC programs.
! It's really more like 2-4x. Sometimes better than C. Where's this huge figure coming from Bulat? If you have code that behaves like this, you should report it. -- Don

Hello Donald, Thursday, November 2, 2006, 2:21:31 PM, you wrote:
10-20 times difference is typical for GHC programs.
It's really more like 2-4x. Sometimes better than C.
Where's this huge figure coming from Bulat? If you have code that behaves like this, you should report it.
are you analyzed the cases where performance is close or even better? i does. it's either because C version is limited by memory performance or just use different, less efficient algorithm the cases which shows slowness of ghc-generated code is factorial algorithm and the program attached. despite that Haskell code is far uglier, C version outperforms it 20 times. run both programs with arrays of about 10k elements to see the difference: a.out 10000 elements 100000 iterations in February i've written detailed explanation (in ghc-users) of why this comes and made some suggestions on improving it. of course, main problem is ghc's own code generator which is far away from gcc or even ocaml ones -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (5)
-
Bulat Ziganshin
-
dons@cse.unsw.edu.au
-
isto
-
Lemmih
-
Lennart Augustsson