
Problem solved. The Haskell loop now wins.
After reading don's blog, I tried to make a test with both methods. The code is very simple, as following
module Main where import System import Numeric.LinearAlgebra
vsum1 :: Vector Double -> Double vsum1 v = constant 1 (dim v) <.> v
vsum2 :: Vector Double -> Double vsum2 v = go (d - 1) 0 where d = dim v go :: Int -> Double -> Double go 0 s = s + (v @> 0) go !j !s = go (j - 1) (s + (v @> j))
mean :: Vector Double -> Double mean v = vsum v / fromIntegral (dim v) where vsum = vsum1
main :: IO () main = do fn:nrow:ncol:_ <- getArgs print . mean . flatten =<< fromFile fn (read nrow, read ncol)
Compile it with "-O2 -optc-O2", and run with a data set of length 5000000. The results are
with "vsum1": 80,077,984 bytes allocated in the heap 2,208 bytes copied during GC (scavenged) 64 bytes copied during GC (not scavenged) 40,894,464 bytes maximum residency (2 sample(s)) %GC time 0.0% (0.0% elapsed) Alloc rate 35,235,448 bytes per MUT second ./vsum1 huge 5000000 1 2.25s user 0.09s system 99% cpu 2.348 total
This is reasonable, exactly two copies of vector with size of 40MB.
with "vsum2": 560,743,120 bytes allocated in the heap 19,160 bytes copied during GC (scavenged) 15,920 bytes copied during GC (not scavenged) 40,919,040 bytes maximum residency (2 sample(s)) %GC time 0.3% (0.3% elapsed) Alloc rate 222,110,261 bytes per MUT second ./mean2 huge 5000000 1 2.53s user 0.06s system 99% cpu 2.598 total
This is strange. A lot of extra usage of heap? Probably because '@>' is not efficient? So it looks like the inner-product approach wins with a fairly margin.
Looking at the core we get from vsum2, we see:
vsum2 compiles quite well
$wgo_s1Nn :: Int# -> Double# -> Double#
$wgo_s1Nn = \ (ww3_s1Mc :: Int#) (ww4_s1Mg :: Double#) -> case ww3_s1Mc of ds_X17R { __DEFAULT -> case Data.Packed.Internal.Vector.$wat @ Double Foreign.Storable.$f9 ww_s1Ms ww1_s1Mu (I# ds_X17R) of { D# y_a1KQ -> $wgo_s1Nn (-# ds_X17R 1) (+## ww4_s1Mg y_a1KQ) }; 0 -> +## ww4_s1Mg ww2_s1M7 }; } in $wgo_s1Nn (-# ww_s1Ms 1) 0.0
But note the return value from $wat is boxed, only to be immediately unboxed. That looks to be the source of the heap allocations.
Let's see if we can help that.
Ah, of course. The problem is the unsafePerformIO used to wrap up the 'peek'. This blocks the optimisations somewhat, after we unpack the Vector type.. So if we rewrite 'safeRead' to not use unsafePerformIO, but instead: safeRead v = inlinePerformIO . withForeignPtr (fptr v) inlinePerformIO (IO m) = case m realWorld# of (# _, r #) -> r Along with this change to Vector: data Vector t = V { dim :: {-# UNPACK #-} !Int , fptr :: {-# UNPACK #-} !(ForeignPtr t) } and some inlining for the bounds checks. Following bytestrings, we see the performance go from, with a 180M input file: Goal: vsum1: $ ghc-core A.hs -optc-O2 -fvia-C -fbang-patterns $ time ./A data 4000 2500 0.9999727441161678 ./A data 4000 2500 5.37s user 0.20s system 99% cpu 5.609 total 160,081,136 bytes allocated in the heap 155 Mb total memory in use Before: vsum2-old: $wgo_s1Ns = \ (ww3_s1Me :: Int#) (ww4_s1Mi :: Double#) -> case ww3_s1Me of ds_X17Y { __DEFAULT -> case Data.Packed.Internal.Vector.$wat @ Double Foreign.Storable.$f9 ww_s1Mu ww1_s1Mw (I# ds_X17Y) of wild1_a1Hg { D# y_a1Hi -> $wgo_s1Ns (-# ds_X17Y 1) (+## ww4_s1Mi y_a1Hi) }; 0 -> +## ww4_s1Mi ww2_s1M9 }; } in $wgo_s1Ns (-# ww_s1Mu 1) 0.0 ./A data 4000 2500 +RTS -sstderr 0.999972744115876 ./A data 4000 2500 +RTS -sstderr 6.04s user 0.15s system 99% cpu 6.203 total 1,121,416,400 bytes allocated in the heap 78 Mb total memory in use After: True -> case readDoubleOffAddr# @ RealWorld ww1_s1Nr ds_X180 realWorld# of wild2_a1HS { (# s2_a1HU, x_a1HV #) -> case touch# @ ForeignPtrContents ww2_s1Ns s2_a1HU of s_a1HG { __DEFAULT -> $wgo_s1Ok (-# ds_X180 1) (+## ww4_s1Ng x_a1HV) } No needless boxing. $ time ./A data 4000 2500 +RTS -sstderr ./A data 4000 2500 +RTS -sstderr 0.999972744115876 ./A data 4000 2500 +RTS -sstderr 5.41s user 0.10s system 99% cpu 5.548 total 80,071,072 bytes allocated in the heap 78 Mb total memory in use And the total runtime and allocation now is better than the fully 'C' version. Yay. Patch attached to the darcs version of hmatrix. Similar changes can likely be done to the other data types in hmatrix, for pretty much ideal code. -- Don