Wow that sped it up 5 times.
I see that boxed Vector is 25% faster than IOArray.
What is the difference and when to use Vector,
when IOArray?
Thanks!

bmaxa@maxa:~/examples$ time ./Cumul +RTS -A1600M
boxed array
last 262486571 seconds 1.196
unboxed array
last 262486571 seconds 0.748
boxed vector
last 262486571 seconds 0.908
unboxed vector
last 262486571 seconds 0.720

real    0m3.805s
user    0m3.428s
sys     0m0.372s


> Date: Sat, 1 Dec 2012 12:20:37 -0500
> Subject: Re: [Haskell-cafe] Why is boxed mutable array so slow?
> From: dons00@gmail.com
> To: bmaxa@hotmail.com
> CC: haskell-cafe@haskell.org
>
> The obvious difference between boxed and unboxed arrays is that the
> boxed arrays are full of pointers to heap allocated objects. This
> means you pay indirection to access the values, much more time in GC
> spent chasing pointers (though card marking helps), and generally do
> more allocation.
>
> Compare the GC stats below, for
>
> * Boxed vector: 88M bytes copied; 75% of time in GC, 0.472s
> * Unboxed vector: 11k bytes copied, 1.3% of time in GC, 0.077s
>
> So there's your main answer. The increased data density of unboxed
> arrays also helps a too.
>
> Now, you can help out the GC signifcantly by hinting at how much
> you're going to allocated in the youngest generation (see the
> ghc-gc-tune app for a methodical approach to this, though it needs
> updating to ghc 7 --
> http://donsbot.wordpress.com/2010/07/05/ghc-gc-tune-tuning-haskell-gc-settings-for-fun-and-profit/
> and http://stackoverflow.com/questions/3171922/ghcs-rts-options-for-garbage-collection
> ).
>
> Use the +RTS -A flag to set an initial youngest generation heap size
> to the size of your array, and watch the GC cost disappear. For our
> boxed vector, we'd use +RTS -A50M, resulting in:
>
> * Boxed vector: 8k copied, 1% of time in GC, 0.157s
>
> So not bad. 3x speedup through a RTS flag. -A is very useful if you
> are working with boxed, mutable arrays.
>
> For reference, there's a generic version below that specializes based
> on the vector type parameter.
>
> ---------------------------------
>
> {-# LANGUAGE BangPatterns #-}
>
> import System.CPUTime
> import Text.Printf
> import Data.Int
> import Control.DeepSeq
> import System.Mem
>
> import qualified Data.Vector.Mutable as V
> import qualified Data.Vector.Unboxed.Mutable as U
> import qualified Data.Vector.Generic.Mutable as G
>
> main :: IO()
> main = do
>
> -- (G.new n' :: IO (V.IOVector Int32)) >>= test' "boxed vector"
> -- performGC
> (G.new n' :: IO (U.IOVector Int32)) >>= test' "unboxed vector"
> performGC
>
> test' s a = do
> putStrLn s
> begin <- getCPUTime
> init'' a
> partial_sum' a
> end <- getCPUTime
> let diff = (fromIntegral (end - begin)) / (10**12)
> last <- G.read a (n'-1)
> printf "last %d seconds %.3f\n" last (diff::Double)
>
> n' :: Int
> n' = 1000 * 1000
>
> init'' !a = init 0 (n'-1)
> where
> init :: Int -> Int -> IO ()
> init !k !n
> | k > n = return ()
> | otherwise = do
> let !x = fromIntegral $ k + k `div` 3
> G.write a k x
> init (k+1) n
>
>
>
> partial_sum' !a = do
> k <- G.read a 0
> ps 1 (n'-1) k
> where
> ps :: Int -> Int -> Int32 -> IO ()
> ps i n s
> | i > n = return ()
> | otherwise = do
> k <- G.read a i
> let !l = fromIntegral $ s + k
> G.write a i l
> ps (i+1) n l
>
>
> ---------------------------------
>
> $ time ./A +RTS -s
> boxed vector
> last 945735787 seconds 0.420
> 40,121,448 bytes allocated in the heap
> 88,355,272 bytes copied during GC
> 24,036,456 bytes maximum residency (6 sample(s))
> 380,632 bytes maximum slop
> 54 MB total memory in use (0 MB lost due to fragmentation)
>
> %GC time 75.2% (75.9% elapsed)
>
> Alloc rate 359,655,602 bytes per MUT second
>
> ./A +RTS -s 0.40s user 0.07s system 98% cpu 0.475 total
>
>
> $ time ./A +RTS -s
> unboxed vector
> last 945735787 seconds 0.080
> 4,113,568 bytes allocated in the heap
> 11,288 bytes copied during GC
> 4,003,256 bytes maximum residency (3 sample(s))
> 182,856 bytes maximum slop
> 5 MB total memory in use (0 MB lost due to fragmentation)
>
> %GC time 1.3% (1.3% elapsed)
>
> Alloc rate 51,416,660 bytes per MUT second
>
> ./A +RTS -s 0.08s user 0.01s system 98% cpu 0.088 total
>
>
> $ time ./A +RTS -A50M -s
> boxed vector
> last 945735787 seconds 0.127
> 40,121,504 bytes allocated in the heap
> 8,032 bytes copied during GC
> 44,704 bytes maximum residency (2 sample(s))
> 20,832 bytes maximum slop
> 59 MB total memory in use (0 MB lost due to fragmentation)
>
> %GC time 1.0% (1.0% elapsed)
>
> Productivity 97.4% of total user, 99.6% of total elapsed
>
> ./A +RTS -A50M -s 0.10s user 0.05s system 97% cpu 0.157 total
>
>
>
> ---------------------------------
>
>
> On Sat, Dec 1, 2012 at 11:09 AM, Branimir Maksimovic <bmaxa@hotmail.com> wrote:
> > I have made benchmark test inspired by
> > http://lemire.me/blog/archives/2012/07/23/is-cc-worth-it/
> >
> > What surprised me is that unboxed array is much faster than boxed array.
> > Actually boxed array performance is on par with standard Haskell list
> > which is very slow.
> > All in all even unboxed array is about 10 times slower than Java version.
> > I don't understand why is even unboxed array so slow.
> > But! unboxed array consumes least amount of RAM.
> > (warning, program consumes more than 3gb of ram)
> >
> > bmaxa@maxa:~/examples$ time ./Cumul
> > boxed array
> > last 262486571 seconds 4.972
> > unboxed array
> > last 262486571 seconds 0.776
> > list
> > last 262486571 seconds 6.812
> >
> > real 0m13.086s
> > user 0m11.996s
> > sys 0m1.080s
> >
> > -------------------------------------------------------------------------
> > {-# LANGUAGE CPP, BangPatterns #-}
> > import System.CPUTime
> > import Text.Printf
> > import Data.Array.IO
> > import Data.Array.Base
> > import Data.Int
> > import Control.DeepSeq
> > import System.Mem
> >
> > main :: IO()
> > main = do
> > (newArray_ (0,n'-1) :: IO(A)) >>= test "boxed array"
> > performGC
> > (newArray_ (0,n'-1) :: IO(B)) >>= test "unboxed array"
> > performGC
> > begin <- getCPUTime
> > printf "list\nlast %d" $ last $ force $ take n' $ sum' data'
> > end <- getCPUTime
> > let diff = (fromIntegral (end - begin)) / (10^12)
> > printf " seconds %.3f\n" (diff::Double)
> >
> > test s a = do
> > putStrLn s
> > begin <- getCPUTime
> > init' a
> > partial_sum a
> > end <- getCPUTime
> > let diff = (fromIntegral (end - begin)) / (10^12)
> > last <- readArray a (n'-1)
> > printf "last %d seconds %.3f\n" last (diff::Double)
> >
> > n' :: Int
> > n' = 50 * 1000 * 1000
> >
> > type A = IOArray Int Int32
> > type B = IOUArray Int Int32
> >
> > init' a = do
> > (_,n) <- getBounds a
> > init a 0 n
> > where
> > init a k n
> > | k > n = return ()
> > | otherwise = do
> > let !x = fromIntegral $ k + k `div` 3
> > unsafeWrite a k x
> > init a (k+1) n
> >
> > partial_sum a = do
> > (_,n) <- getBounds a
> > k <- unsafeRead a 0
> > ps a 1 n k
> > where
> > ps a i n s
> > | i > n = return ()
> > | otherwise = do
> > k <- unsafeRead a i
> > let !l = fromIntegral $ s + k
> > unsafeWrite a i l
> > ps a (i+1) n l
> >
> > data' :: [Int32]
> > data' = [k + k `div` 3 | k <- [0..] ]
> >
> > sum' = scanl1 (+)
> >
> >
> > _______________________________________________
> > Haskell-Cafe mailing list
> > Haskell-Cafe@haskell.org
> > http://www.haskell.org/mailman/listinfo/haskell-cafe
> >