Re: [Haskell-cafe] Why is boxed mutable array so slow?

1 Dec 2012

      The obvious difference between boxed and unboxed arrays is that the
boxed arrays are full of pointers to heap allocated objects. This
means you pay indirection to access the values, much more time in GC
spent chasing pointers (though card marking helps), and generally do
more allocation.

Compare the GC stats below, for

* Boxed vector: 88M bytes copied; 75% of time in GC, 0.472s
* Unboxed vector: 11k bytes copied, 1.3% of time in GC, 0.077s

So there's your main answer. The increased data density of unboxed
arrays also helps a too.

Now, you can help out  the GC signifcantly by hinting at how much
you're going to allocated in the youngest generation (see the
ghc-gc-tune app for a methodical approach to this, though it needs
updating to ghc 7 --
http://donsbot.wordpress.com/2010/07/05/ghc-gc-tune-tuning-haskell-gc-settin...
 and http://stackoverflow.com/questions/3171922/ghcs-rts-options-for-garbage-coll...
).

Use the +RTS -A flag to set an initial youngest generation heap size
to the size of your array, and watch the GC cost disappear. For our
boxed vector, we'd use +RTS -A50M, resulting in:

* Boxed vector: 8k copied, 1% of time in GC, 0.157s

So not bad. 3x speedup through a RTS flag. -A is very useful if you
are working with boxed, mutable arrays.

For reference, there's a generic version below that specializes based
on the vector type parameter.

---------------------------------

{-# LANGUAGE BangPatterns #-}

import System.CPUTime
import Text.Printf
import Data.Int
import Control.DeepSeq
import System.Mem

import qualified Data.Vector.Mutable as V
import qualified Data.Vector.Unboxed.Mutable as U
import qualified Data.Vector.Generic.Mutable as G

main :: IO()
main = do

--   (G.new n' :: IO (V.IOVector Int32)) >>= test' "boxed vector"
--   performGC
   (G.new n' :: IO (U.IOVector Int32)) >>= test' "unboxed vector"
   performGC

test' s a = do
    putStrLn s
    begin <- getCPUTime
    init'' a
    partial_sum' a
    end <- getCPUTime
    let diff = (fromIntegral (end - begin)) / (10**12)
    last <- G.read a (n'-1)
    printf "last %d seconds %.3f\n" last (diff::Double)

n' :: Int
n' = 1000 * 1000

init'' !a = init 0 (n'-1)
  where
    init :: Int -> Int -> IO ()
    init !k !n
        | k > n     = return ()
        | otherwise = do
            let !x = fromIntegral $ k + k `div` 3
            G.write a k x
            init (k+1) n

partial_sum' !a = do
    k <- G.read a 0
    ps 1 (n'-1) k
  where
    ps :: Int -> Int -> Int32 -> IO ()
    ps i n s
        | i > n     = return ()
        | otherwise = do
            k <- G.read a i
            let !l = fromIntegral $ s + k
            G.write a i l
            ps (i+1) n l

---------------------------------

$ time ./A +RTS -s
boxed vector
last 945735787 seconds 0.420
      40,121,448 bytes allocated in the heap
      88,355,272 bytes copied during GC
      24,036,456 bytes maximum residency (6 sample(s))
         380,632 bytes maximum slop
              54 MB total memory in use (0 MB lost due to fragmentation)

  %GC     time      75.2%  (75.9% elapsed)

  Alloc rate    359,655,602 bytes per MUT second

./A +RTS -s  0.40s user 0.07s system 98% cpu 0.475 total

$ time ./A +RTS -s
unboxed vector
last 945735787 seconds 0.080
       4,113,568 bytes allocated in the heap
          11,288 bytes copied during GC
       4,003,256 bytes maximum residency (3 sample(s))
         182,856 bytes maximum slop
               5 MB total memory in use (0 MB lost due to fragmentation)

  %GC     time       1.3%  (1.3% elapsed)

  Alloc rate    51,416,660 bytes per MUT second

./A +RTS -s  0.08s user 0.01s system 98% cpu 0.088 total

$ time ./A +RTS -A50M -s
boxed vector
last 945735787 seconds 0.127
      40,121,504 bytes allocated in the heap
           8,032 bytes copied during GC
          44,704 bytes maximum residency (2 sample(s))
          20,832 bytes maximum slop
              59 MB total memory in use (0 MB lost due to fragmentation)

  %GC     time       1.0%  (1.0% elapsed)

  Productivity  97.4% of total user, 99.6% of total elapsed

./A +RTS -A50M -s  0.10s user 0.05s system 97% cpu 0.157 total

---------------------------------

On Sat, Dec 1, 2012 at 11:09 AM, Branimir Maksimovic  wrote:
...
I have made benchmark test inspired by
http://lemire.me/blog/archives/2012/07/23/is-cc-worth-it/
What surprised me is that unboxed array is much faster than boxed array.
Actually boxed array performance is on par with standard Haskell list
which is very slow.
All in all even unboxed array is about 10 times slower than Java version.
I don't understand why is even unboxed array so slow.
But! unboxed array consumes least amount of RAM.
(warning, program consumes more than 3gb of ram)
bmaxa@maxa:~/examples$ time ./Cumul
boxed array
last 262486571 seconds 4.972
unboxed array
last 262486571 seconds 0.776
list
last 262486571 seconds 6.812
real    0m13.086s
user    0m11.996s
sys     0m1.080s
-------------------------------------------------------------------------
{-# LANGUAGE CPP, BangPatterns #-}
import System.CPUTime
import Text.Printf
import Data.Array.IO
import Data.Array.Base
import Data.Int
import Control.DeepSeq
import System.Mem
main :: IO()
main = do
(newArray_ (0,n'-1) :: IO(A)) >>= test "boxed array"
performGC
(newArray_ (0,n'-1) :: IO(B)) >>= test "unboxed array"
performGC
begin <- getCPUTime
printf "list\nlast %d" $ last $ force $ take n' $ sum' data'
end <- getCPUTime
let diff = (fromIntegral (end - begin)) / (10^12)
printf " seconds %.3f\n" (diff::Double)
test s a = do
putStrLn s
begin <- getCPUTime
init' a
partial_sum a
end <- getCPUTime
let diff = (fromIntegral (end - begin)) / (10^12)
last <- readArray a (n'-1)
printf "last %d seconds %.3f\n" last (diff::Double)
n' :: Int
n' = 50 * 1000 * 1000
type A = IOArray Int Int32
type B = IOUArray Int Int32
init' a = do
(_,n) <- getBounds a
init a 0 n
where
init a k n
| k > n = return ()
| otherwise = do
let  !x = fromIntegral $ k + k `div` 3
unsafeWrite a k x
init a (k+1) n
partial_sum a = do
(_,n) <- getBounds a
k <- unsafeRead a 0
ps a 1 n k
where
ps a i n s
| i > n = return ()
| otherwise = do
k <- unsafeRead a i
let !l = fromIntegral $ s + k
unsafeWrite a i l
ps a (i+1) n l
data' :: [Int32]
data' = [k + k `div` 3 | k <- [0..] ]
sum' = scanl1 (+)
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Why is boxed mutable array so slow?

Don Stewart