
Below are some notes on this for Simon PJ and Alberto. In general, GHC is doing very well here, with only one small wibble preventing the recursive version running as fast as the C version. xj2106:
Alberto Ruiz
writes: My experience is that Haskell allocation time is very fast and usually negligible in most non trivial matrix computations.
A good thing about
sum v = constant 1 (dim v) <.> v
is that a constant vector is efficiently created internally (not from an intermediate Haskell list), and the inner product will be computed by the possibly optimized blas version available in your machine.
You can also write simple definitions like the next one for the average of the rows of a matrix as a simple vector-matrix product:
mean m = w <> m where w = constant k r r = rows m k = 1 / fromIntegral r
I prefer high level "index free" matrix operations to C-like code. Definitions are clearer, have less bugs, and are also more efficient if you use optimized numerical libs.
In any case, many algorithms can be solved by the recursive approach described by Tomas.
After reading don's blog, I tried to make a test with both methods. The code is very simple, as following
module Main where import System import Numeric.LinearAlgebra
vsum1 :: Vector Double -> Double vsum1 v = constant 1 (dim v) <.> v
vsum2 :: Vector Double -> Double vsum2 v = go (d - 1) 0 where d = dim v go :: Int -> Double -> Double go 0 s = s + (v @> 0) go !j !s = go (j - 1) (s + (v @> j))
mean :: Vector Double -> Double mean v = vsum v / fromIntegral (dim v) where vsum = vsum1
main :: IO () main = do fn:nrow:ncol:_ <- getArgs print . mean . flatten =<< fromFile fn (read nrow, read ncol)
Compile it with "-O2 -optc-O2", and run with a data set of length 5000000. The results are
with "vsum1": 80,077,984 bytes allocated in the heap 2,208 bytes copied during GC (scavenged) 64 bytes copied during GC (not scavenged) 40,894,464 bytes maximum residency (2 sample(s)) %GC time 0.0% (0.0% elapsed) Alloc rate 35,235,448 bytes per MUT second ./vsum1 huge 5000000 1 2.25s user 0.09s system 99% cpu 2.348 total
This is reasonable, exactly two copies of vector with size of 40MB.
with "vsum2": 560,743,120 bytes allocated in the heap 19,160 bytes copied during GC (scavenged) 15,920 bytes copied during GC (not scavenged) 40,919,040 bytes maximum residency (2 sample(s)) %GC time 0.3% (0.3% elapsed) Alloc rate 222,110,261 bytes per MUT second ./mean2 huge 5000000 1 2.53s user 0.06s system 99% cpu 2.598 total
This is strange. A lot of extra usage of heap? Probably because '@>' is not efficient? So it looks like the inner-product approach wins with a fairly margin.
Yes, I'd suspect that @> isn't fully unlifted, so you get some heap allocation on each index, each time around the loop? We'd have to look at how @> was defined to spot why. I began with: $ cabal install hmatrix which failed, due to missing linking against gfortran and glscblas Added those to the cabal file, and hmatrix was installed. Looking at the core we get from vsum2, we see: vsum2 compiles quite well $wgo_s1Nn :: Int# -> Double# -> Double# $wgo_s1Nn = \ (ww3_s1Mc :: Int#) (ww4_s1Mg :: Double#) -> case ww3_s1Mc of ds_X17R { __DEFAULT -> case Data.Packed.Internal.Vector.$wat @ Double Foreign.Storable.$f9 ww_s1Ms ww1_s1Mu (I# ds_X17R) of { D# y_a1KQ -> $wgo_s1Nn (-# ds_X17R 1) (+## ww4_s1Mg y_a1KQ) }; 0 -> +## ww4_s1Mg ww2_s1M7 }; } in $wgo_s1Nn (-# ww_s1Ms 1) 0.0 But note the return value from $wat is boxed, only to be immediately unboxed. That looks to be the source of the heap allocations. Let's see if we can help that. Vector is defined as: data Vector t = V { dim :: Int -- ^ number of elements , fptr :: ForeignPtr t -- ^ foreign pointer to the memory block } While a much better definition would be: data Vector t = V { dim :: {-# UNPACK #-} !Int -- ^ number of elements , fptr :: {-# UNPACK #-} !(ForeignPtr t) -- ^ foreign pointer to the memory block } And we can add some inlining to at' and at. That might be enough for GHC to see through to the D# boxing. Now they're fully unfolded and specialised into our source program, True -> case unsafeDupablePerformIO @ Double ((\ (eta1_a1KA :: State# RealWorld) -> case noDuplicate# eta1_a1KA of s'_a1KB { __DEFAULT -> case readDoubleOffAddr# @ RealWorld ww1_s1NB ds_X184 s'_a1KB of wild2_a1Lc { (# s2_a1Le, x_a1Lf #) -> case touch# @ ForeignPtrContents ww2_s1NC s2_a1Le of s_a1KS { __DEFAULT -> BAD -----> (# s_a1KS, D# x_a1Lf #) } } }) `cast` (sym ((:CoIO) Double) :: State# RealWorld -> (# State# RealWorld, Double #) ~ IO Double)) of wild11_a1LC { D# y_a1LE -> $wgo_s1Oz (-# ds_X184 1) (+## ww4_s1Nq y_a1LE) But still the readDoubleOffAddr# result is boxed into D#, and then unboxed in the loop. A GHC optimiser flaw? Possibly the same as: http://hackage.haskell.org/trac/ghc/ticket/2289 Simon PJ, does that seem right? While vsum1 is pretty much unoptimised calls into C; vsum1 :: Data.Packed.Internal.Vector.Vector Double -> Double vsum1 = \ (v_anC :: Data.Packed.Internal.Vector.Vector Double) -> Numeric.LinearAlgebra.Linear.dot @ Double Data.Packed.Internal.Matrix.$f2 (Data.Packed.Internal.Matrix.$sconstantAux Data.Packed.Internal.Matrix.cconstantR lit (case v_anC of tpl_B2 { Data.Packed.Internal.Vector.V ipv_B3 ipv1_B4 -> ipv_B3 })) v_anC Alberto, I would modify the other types in hmatrix similary, data Matrix t = MC { rows :: {-# UNPACK #-} !Int, ... cols :: !Int, cdat :: !(Vector t) } | MF { rows :: {-# UNPACK #-} !Int, ...cols :: !Int, fdat :: !(Vector t) } ensuring we can keep these things in registers. But it would be wise to have some benchmarking too. Alberto, what do you think? Do you have some test data we could play with to see if unboxing the haskell skin over the underlying C calls helps with Haskell-side loops and iterative calls? In general, though, hmatrix looks *very* well written. I'd have good confidence in this library for performance work. -- Don