
Hi, probably I am just being dumb, but what is the most efficient way to do a sum of every elements in a Vector of either hmatrix or blas? I know there is sum of absolute values from BLAS. So what about I want a plain "sum"? I can only think of the following two ways. 1. Using Data.List.foldl', so it is
sum' = foldl' (+) 0 . toList
in hmatrix or
sum' = foldl' (+) 0 . elems
in blas. 2. Using <.>, so it is
sum' v = constant 1 (dim v) <.> v
in hmatrix or
sum' v = constant (dim v) 1 <.> v
in blas. Which one is better? I guess it probably depends on the internal implementation of BLAS, but I am actually thinking in some thing similar to
t += *v++;
Any delightful idea to convert my mind from a C shaped one to a Haskell shaped one? Best, Xiao-Yong -- c/* __o/* <\ * (__ */\ <

Den Sunday 08 June 2008 00.45.01 skrev Xiao-Yong Jin:
Any delightful idea to convert my mind from a C shaped one to a Haskell shaped one?
You can never go wrong with a good old fashioned hand written tail recursion when you're in doubt, they are pretty much the closest thing to for-loops there is in haskell and should be easy to grok for Imperative programmers and usually produce really fast code. sum vect = let d = dim vect in sum' (d - 1) 0 where sum' 0 s = s + (vect @> 0) sum' index s = sum' (index - 1) (s + (vect @> index)) I don't know the hmatrix api very well, but if there is a function for computing the inner product between two vectors you could always do something like the following meta code: sum v = innerProduct v <1,1,1,1,1,1>
Best, Xiao-Yong

Tomas Andersson
You can never go wrong with a good old fashioned hand written tail recursion when you're in doubt, they are pretty much the closest thing to for-loops there is in haskell and should be easy to grok for Imperative programmers and usually produce really fast code.
sum vect = let d = dim vect in sum' (d - 1) 0 where sum' 0 s = s + (vect @> 0) sum' index s = sum' (index - 1) (s + (vect @> index))
Do I need the strict version of sum'? Put something like sum' a b | a `seq` b `seq` False = undefined Or ghc will optimize it automatically? I always don't know when such optimization is useful.
I don't know the hmatrix api very well, but if there is a function for computing the inner product between two vectors you could always do something like the following meta code:
sum v = innerProduct v <1,1,1,1,1,1>
I just doubt the efficiency here. If v is a very large vector, I guess the allocation time of that intermediate vector [1,1..] is not small. But it is just my guess. Xiao-Yong -- c/* __o/* <\ * (__ */\ <

xj2106:
Tomas Andersson
writes: You can never go wrong with a good old fashioned hand written tail recursion when you're in doubt, they are pretty much the closest thing to for-loops there is in haskell and should be easy to grok for Imperative programmers and usually produce really fast code.
sum vect = let d = dim vect in sum' (d - 1) 0 where sum' 0 s = s + (vect @> 0) sum' index s = sum' (index - 1) (s + (vect @> index))
Do I need the strict version of sum'? Put something like
sum' a b | a `seq` b `seq` False = undefined
Or ghc will optimize it automatically? I always don't know when such optimization is useful.
If you give a type annotation, so GHC can tell its an atomic type, like Int or Double, 's' will be inferred as strict. But if in doubt, use bang patterns: {-# LANGUAGE BangPatterns #-} sum vect = sum' (d-1) 0 where d = dim vect go :: Int -> Double -> Double -- for example go 0 s = s + (vect @> 0) go !i !s = go (i-1) (s + (vect @> i)) See this recent post on understanding how these kind of loops are compiled, http://cgi.cse.unsw.edu.au/~dons/blog/2008/05/16 The core for your loop should look very good, assuming @> is efficient. -- Don

Xiao-Yong Jin wrote:
Tomas Andersson
writes: You can never go wrong with a good old fashioned hand written tail recursion when you're in doubt, they are pretty much the closest thing to for-loops there is in haskell and should be easy to grok for Imperative programmers and usually produce really fast code.
sum vect = let d = dim vect in sum' (d - 1) 0 where sum' 0 s = s + (vect @> 0) sum' index s = sum' (index - 1) (s + (vect @> index))
Do I need the strict version of sum'? Put something like
sum' a b | a `seq` b `seq` False = undefined
Or ghc will optimize it automatically? I always don't know when such optimization is useful.
I don't know the hmatrix api very well, but if there is a function for computing the inner product between two vectors you could always do something like the following meta code:
sum v = innerProduct v <1,1,1,1,1,1>
I just doubt the efficiency here. If v is a very large vector, I guess the allocation time of that intermediate vector [1,1..] is not small. But it is just my guess.
My experience is that Haskell allocation time is very fast and usually negligible in most non trivial matrix computations. A good thing about sum v = constant 1 (dim v) <.> v is that a constant vector is efficiently created internally (not from an intermediate Haskell list), and the inner product will be computed by the possibly optimized blas version available in your machine. You can also write simple definitions like the next one for the average of the rows of a matrix as a simple vector-matrix product: mean m = w <> m where w = constant k r r = rows m k = 1 / fromIntegral r I prefer high level "index free" matrix operations to C-like code. Definitions are clearer, have less bugs, and are also more efficient if you use optimized numerical libs. In any case, many algorithms can be solved by the recursive approach described by Tomas. Alberto
Xiao-Yong

Alberto Ruiz
My experience is that Haskell allocation time is very fast and usually negligible in most non trivial matrix computations.
A good thing about
sum v = constant 1 (dim v) <.> v
is that a constant vector is efficiently created internally (not from an intermediate Haskell list), and the inner product will be computed by the possibly optimized blas version available in your machine.
You can also write simple definitions like the next one for the average of the rows of a matrix as a simple vector-matrix product:
mean m = w <> m where w = constant k r r = rows m k = 1 / fromIntegral r
I prefer high level "index free" matrix operations to C-like code. Definitions are clearer, have less bugs, and are also more efficient if you use optimized numerical libs.
In any case, many algorithms can be solved by the recursive approach described by Tomas.
After reading don's blog, I tried to make a test with both methods. The code is very simple, as following module Main where import System import Numeric.LinearAlgebra vsum1 :: Vector Double -> Double vsum1 v = constant 1 (dim v) <.> v vsum2 :: Vector Double -> Double vsum2 v = go (d - 1) 0 where d = dim v go :: Int -> Double -> Double go 0 s = s + (v @> 0) go !j !s = go (j - 1) (s + (v @> j)) mean :: Vector Double -> Double mean v = vsum v / fromIntegral (dim v) where vsum = vsum1 main :: IO () main = do fn:nrow:ncol:_ <- getArgs print . mean . flatten =<< fromFile fn (read nrow, read ncol) Compile it with "-O2 -optc-O2", and run with a data set of length 5000000. The results are with "vsum1": 80,077,984 bytes allocated in the heap 2,208 bytes copied during GC (scavenged) 64 bytes copied during GC (not scavenged) 40,894,464 bytes maximum residency (2 sample(s)) %GC time 0.0% (0.0% elapsed) Alloc rate 35,235,448 bytes per MUT second ./vsum1 huge 5000000 1 2.25s user 0.09s system 99% cpu 2.348 total This is reasonable, exactly two copies of vector with size of 40MB. with "vsum2": 560,743,120 bytes allocated in the heap 19,160 bytes copied during GC (scavenged) 15,920 bytes copied during GC (not scavenged) 40,919,040 bytes maximum residency (2 sample(s)) %GC time 0.3% (0.3% elapsed) Alloc rate 222,110,261 bytes per MUT second ./mean2 huge 5000000 1 2.53s user 0.06s system 99% cpu 2.598 total This is strange. A lot of extra usage of heap? Probably because '@>' is not efficient? So it looks like the inner-product approach wins with a fairly margin. -- c/* __o/* <\ * (__ */\ <

Below are some notes on this for Simon PJ and Alberto. In general, GHC is doing very well here, with only one small wibble preventing the recursive version running as fast as the C version. xj2106:
Alberto Ruiz
writes: My experience is that Haskell allocation time is very fast and usually negligible in most non trivial matrix computations.
A good thing about
sum v = constant 1 (dim v) <.> v
is that a constant vector is efficiently created internally (not from an intermediate Haskell list), and the inner product will be computed by the possibly optimized blas version available in your machine.
You can also write simple definitions like the next one for the average of the rows of a matrix as a simple vector-matrix product:
mean m = w <> m where w = constant k r r = rows m k = 1 / fromIntegral r
I prefer high level "index free" matrix operations to C-like code. Definitions are clearer, have less bugs, and are also more efficient if you use optimized numerical libs.
In any case, many algorithms can be solved by the recursive approach described by Tomas.
After reading don's blog, I tried to make a test with both methods. The code is very simple, as following
module Main where import System import Numeric.LinearAlgebra
vsum1 :: Vector Double -> Double vsum1 v = constant 1 (dim v) <.> v
vsum2 :: Vector Double -> Double vsum2 v = go (d - 1) 0 where d = dim v go :: Int -> Double -> Double go 0 s = s + (v @> 0) go !j !s = go (j - 1) (s + (v @> j))
mean :: Vector Double -> Double mean v = vsum v / fromIntegral (dim v) where vsum = vsum1
main :: IO () main = do fn:nrow:ncol:_ <- getArgs print . mean . flatten =<< fromFile fn (read nrow, read ncol)
Compile it with "-O2 -optc-O2", and run with a data set of length 5000000. The results are
with "vsum1": 80,077,984 bytes allocated in the heap 2,208 bytes copied during GC (scavenged) 64 bytes copied during GC (not scavenged) 40,894,464 bytes maximum residency (2 sample(s)) %GC time 0.0% (0.0% elapsed) Alloc rate 35,235,448 bytes per MUT second ./vsum1 huge 5000000 1 2.25s user 0.09s system 99% cpu 2.348 total
This is reasonable, exactly two copies of vector with size of 40MB.
with "vsum2": 560,743,120 bytes allocated in the heap 19,160 bytes copied during GC (scavenged) 15,920 bytes copied during GC (not scavenged) 40,919,040 bytes maximum residency (2 sample(s)) %GC time 0.3% (0.3% elapsed) Alloc rate 222,110,261 bytes per MUT second ./mean2 huge 5000000 1 2.53s user 0.06s system 99% cpu 2.598 total
This is strange. A lot of extra usage of heap? Probably because '@>' is not efficient? So it looks like the inner-product approach wins with a fairly margin.
Yes, I'd suspect that @> isn't fully unlifted, so you get some heap allocation on each index, each time around the loop? We'd have to look at how @> was defined to spot why. I began with: $ cabal install hmatrix which failed, due to missing linking against gfortran and glscblas Added those to the cabal file, and hmatrix was installed. Looking at the core we get from vsum2, we see: vsum2 compiles quite well $wgo_s1Nn :: Int# -> Double# -> Double# $wgo_s1Nn = \ (ww3_s1Mc :: Int#) (ww4_s1Mg :: Double#) -> case ww3_s1Mc of ds_X17R { __DEFAULT -> case Data.Packed.Internal.Vector.$wat @ Double Foreign.Storable.$f9 ww_s1Ms ww1_s1Mu (I# ds_X17R) of { D# y_a1KQ -> $wgo_s1Nn (-# ds_X17R 1) (+## ww4_s1Mg y_a1KQ) }; 0 -> +## ww4_s1Mg ww2_s1M7 }; } in $wgo_s1Nn (-# ww_s1Ms 1) 0.0 But note the return value from $wat is boxed, only to be immediately unboxed. That looks to be the source of the heap allocations. Let's see if we can help that. Vector is defined as: data Vector t = V { dim :: Int -- ^ number of elements , fptr :: ForeignPtr t -- ^ foreign pointer to the memory block } While a much better definition would be: data Vector t = V { dim :: {-# UNPACK #-} !Int -- ^ number of elements , fptr :: {-# UNPACK #-} !(ForeignPtr t) -- ^ foreign pointer to the memory block } And we can add some inlining to at' and at. That might be enough for GHC to see through to the D# boxing. Now they're fully unfolded and specialised into our source program, True -> case unsafeDupablePerformIO @ Double ((\ (eta1_a1KA :: State# RealWorld) -> case noDuplicate# eta1_a1KA of s'_a1KB { __DEFAULT -> case readDoubleOffAddr# @ RealWorld ww1_s1NB ds_X184 s'_a1KB of wild2_a1Lc { (# s2_a1Le, x_a1Lf #) -> case touch# @ ForeignPtrContents ww2_s1NC s2_a1Le of s_a1KS { __DEFAULT -> BAD -----> (# s_a1KS, D# x_a1Lf #) } } }) `cast` (sym ((:CoIO) Double) :: State# RealWorld -> (# State# RealWorld, Double #) ~ IO Double)) of wild11_a1LC { D# y_a1LE -> $wgo_s1Oz (-# ds_X184 1) (+## ww4_s1Nq y_a1LE) But still the readDoubleOffAddr# result is boxed into D#, and then unboxed in the loop. A GHC optimiser flaw? Possibly the same as: http://hackage.haskell.org/trac/ghc/ticket/2289 Simon PJ, does that seem right? While vsum1 is pretty much unoptimised calls into C; vsum1 :: Data.Packed.Internal.Vector.Vector Double -> Double vsum1 = \ (v_anC :: Data.Packed.Internal.Vector.Vector Double) -> Numeric.LinearAlgebra.Linear.dot @ Double Data.Packed.Internal.Matrix.$f2 (Data.Packed.Internal.Matrix.$sconstantAux Data.Packed.Internal.Matrix.cconstantR lit (case v_anC of tpl_B2 { Data.Packed.Internal.Vector.V ipv_B3 ipv1_B4 -> ipv_B3 })) v_anC Alberto, I would modify the other types in hmatrix similary, data Matrix t = MC { rows :: {-# UNPACK #-} !Int, ... cols :: !Int, cdat :: !(Vector t) } | MF { rows :: {-# UNPACK #-} !Int, ...cols :: !Int, fdat :: !(Vector t) } ensuring we can keep these things in registers. But it would be wise to have some benchmarking too. Alberto, what do you think? Do you have some test data we could play with to see if unboxing the haskell skin over the underlying C calls helps with Haskell-side loops and iterative calls? In general, though, hmatrix looks *very* well written. I'd have good confidence in this library for performance work. -- Don

Problem solved. The Haskell loop now wins.
After reading don's blog, I tried to make a test with both methods. The code is very simple, as following
module Main where import System import Numeric.LinearAlgebra
vsum1 :: Vector Double -> Double vsum1 v = constant 1 (dim v) <.> v
vsum2 :: Vector Double -> Double vsum2 v = go (d - 1) 0 where d = dim v go :: Int -> Double -> Double go 0 s = s + (v @> 0) go !j !s = go (j - 1) (s + (v @> j))
mean :: Vector Double -> Double mean v = vsum v / fromIntegral (dim v) where vsum = vsum1
main :: IO () main = do fn:nrow:ncol:_ <- getArgs print . mean . flatten =<< fromFile fn (read nrow, read ncol)
Compile it with "-O2 -optc-O2", and run with a data set of length 5000000. The results are
with "vsum1": 80,077,984 bytes allocated in the heap 2,208 bytes copied during GC (scavenged) 64 bytes copied during GC (not scavenged) 40,894,464 bytes maximum residency (2 sample(s)) %GC time 0.0% (0.0% elapsed) Alloc rate 35,235,448 bytes per MUT second ./vsum1 huge 5000000 1 2.25s user 0.09s system 99% cpu 2.348 total
This is reasonable, exactly two copies of vector with size of 40MB.
with "vsum2": 560,743,120 bytes allocated in the heap 19,160 bytes copied during GC (scavenged) 15,920 bytes copied during GC (not scavenged) 40,919,040 bytes maximum residency (2 sample(s)) %GC time 0.3% (0.3% elapsed) Alloc rate 222,110,261 bytes per MUT second ./mean2 huge 5000000 1 2.53s user 0.06s system 99% cpu 2.598 total
This is strange. A lot of extra usage of heap? Probably because '@>' is not efficient? So it looks like the inner-product approach wins with a fairly margin.
Looking at the core we get from vsum2, we see:
vsum2 compiles quite well
$wgo_s1Nn :: Int# -> Double# -> Double#
$wgo_s1Nn = \ (ww3_s1Mc :: Int#) (ww4_s1Mg :: Double#) -> case ww3_s1Mc of ds_X17R { __DEFAULT -> case Data.Packed.Internal.Vector.$wat @ Double Foreign.Storable.$f9 ww_s1Ms ww1_s1Mu (I# ds_X17R) of { D# y_a1KQ -> $wgo_s1Nn (-# ds_X17R 1) (+## ww4_s1Mg y_a1KQ) }; 0 -> +## ww4_s1Mg ww2_s1M7 }; } in $wgo_s1Nn (-# ww_s1Ms 1) 0.0
But note the return value from $wat is boxed, only to be immediately unboxed. That looks to be the source of the heap allocations.
Let's see if we can help that.
Ah, of course. The problem is the unsafePerformIO used to wrap up the 'peek'. This blocks the optimisations somewhat, after we unpack the Vector type.. So if we rewrite 'safeRead' to not use unsafePerformIO, but instead: safeRead v = inlinePerformIO . withForeignPtr (fptr v) inlinePerformIO (IO m) = case m realWorld# of (# _, r #) -> r Along with this change to Vector: data Vector t = V { dim :: {-# UNPACK #-} !Int , fptr :: {-# UNPACK #-} !(ForeignPtr t) } and some inlining for the bounds checks. Following bytestrings, we see the performance go from, with a 180M input file: Goal: vsum1: $ ghc-core A.hs -optc-O2 -fvia-C -fbang-patterns $ time ./A data 4000 2500 0.9999727441161678 ./A data 4000 2500 5.37s user 0.20s system 99% cpu 5.609 total 160,081,136 bytes allocated in the heap 155 Mb total memory in use Before: vsum2-old: $wgo_s1Ns = \ (ww3_s1Me :: Int#) (ww4_s1Mi :: Double#) -> case ww3_s1Me of ds_X17Y { __DEFAULT -> case Data.Packed.Internal.Vector.$wat @ Double Foreign.Storable.$f9 ww_s1Mu ww1_s1Mw (I# ds_X17Y) of wild1_a1Hg { D# y_a1Hi -> $wgo_s1Ns (-# ds_X17Y 1) (+## ww4_s1Mi y_a1Hi) }; 0 -> +## ww4_s1Mi ww2_s1M9 }; } in $wgo_s1Ns (-# ww_s1Mu 1) 0.0 ./A data 4000 2500 +RTS -sstderr 0.999972744115876 ./A data 4000 2500 +RTS -sstderr 6.04s user 0.15s system 99% cpu 6.203 total 1,121,416,400 bytes allocated in the heap 78 Mb total memory in use After: True -> case readDoubleOffAddr# @ RealWorld ww1_s1Nr ds_X180 realWorld# of wild2_a1HS { (# s2_a1HU, x_a1HV #) -> case touch# @ ForeignPtrContents ww2_s1Ns s2_a1HU of s_a1HG { __DEFAULT -> $wgo_s1Ok (-# ds_X180 1) (+## ww4_s1Ng x_a1HV) } No needless boxing. $ time ./A data 4000 2500 +RTS -sstderr ./A data 4000 2500 +RTS -sstderr 0.999972744115876 ./A data 4000 2500 +RTS -sstderr 5.41s user 0.10s system 99% cpu 5.548 total 80,071,072 bytes allocated in the heap 78 Mb total memory in use And the total runtime and allocation now is better than the fully 'C' version. Yay. Patch attached to the darcs version of hmatrix. Similar changes can likely be done to the other data types in hmatrix, for pretty much ideal code. -- Don

Patch applied! I will also make the recommended changes in the Matrix type and prepare some benchmarks for this kind of Haskell loops. In fact, such excellent performance will also be very useful in image processing algorithms which must read and write a large number of C array elements. I now think that some C auxiliary functions in the easyVision library for operations not available in the IPP can be replaced by much nicer Haskell versions. I'm currently working on the implementation of interest point descriptors based on histograms of local gradient directions. This may be a very good benchmark. I will prepare a first version and post here the results for discussion. Don, many thanks for your all your help! Alberto Don Stewart wrote:
Problem solved. The Haskell loop now wins.
After reading don's blog, I tried to make a test with both methods. The code is very simple, as following
module Main where import System import Numeric.LinearAlgebra
vsum1 :: Vector Double -> Double vsum1 v = constant 1 (dim v) <.> v
vsum2 :: Vector Double -> Double vsum2 v = go (d - 1) 0 where d = dim v go :: Int -> Double -> Double go 0 s = s + (v @> 0) go !j !s = go (j - 1) (s + (v @> j))
mean :: Vector Double -> Double mean v = vsum v / fromIntegral (dim v) where vsum = vsum1
main :: IO () main = do fn:nrow:ncol:_ <- getArgs print . mean . flatten =<< fromFile fn (read nrow, read ncol)
Compile it with "-O2 -optc-O2", and run with a data set of length 5000000. The results are
with "vsum1": 80,077,984 bytes allocated in the heap 2,208 bytes copied during GC (scavenged) 64 bytes copied during GC (not scavenged) 40,894,464 bytes maximum residency (2 sample(s)) %GC time 0.0% (0.0% elapsed) Alloc rate 35,235,448 bytes per MUT second ./vsum1 huge 5000000 1 2.25s user 0.09s system 99% cpu 2.348 total
This is reasonable, exactly two copies of vector with size of 40MB.
with "vsum2": 560,743,120 bytes allocated in the heap 19,160 bytes copied during GC (scavenged) 15,920 bytes copied during GC (not scavenged) 40,919,040 bytes maximum residency (2 sample(s)) %GC time 0.3% (0.3% elapsed) Alloc rate 222,110,261 bytes per MUT second ./mean2 huge 5000000 1 2.53s user 0.06s system 99% cpu 2.598 total
This is strange. A lot of extra usage of heap? Probably because '@>' is not efficient? So it looks like the inner-product approach wins with a fairly margin.
Looking at the core we get from vsum2, we see:
vsum2 compiles quite well
$wgo_s1Nn :: Int# -> Double# -> Double#
$wgo_s1Nn = \ (ww3_s1Mc :: Int#) (ww4_s1Mg :: Double#) -> case ww3_s1Mc of ds_X17R { __DEFAULT -> case Data.Packed.Internal.Vector.$wat @ Double Foreign.Storable.$f9 ww_s1Ms ww1_s1Mu (I# ds_X17R) of { D# y_a1KQ -> $wgo_s1Nn (-# ds_X17R 1) (+## ww4_s1Mg y_a1KQ) }; 0 -> +## ww4_s1Mg ww2_s1M7 }; } in $wgo_s1Nn (-# ww_s1Ms 1) 0.0
But note the return value from $wat is boxed, only to be immediately unboxed. That looks to be the source of the heap allocations.
Let's see if we can help that.
Ah, of course. The problem is the unsafePerformIO used to wrap up the 'peek'. This blocks the optimisations somewhat, after we unpack the Vector type..
So if we rewrite 'safeRead' to not use unsafePerformIO, but instead:
safeRead v = inlinePerformIO . withForeignPtr (fptr v)
inlinePerformIO (IO m) = case m realWorld# of (# _, r #) -> r
Along with this change to Vector:
data Vector t = V { dim :: {-# UNPACK #-} !Int , fptr :: {-# UNPACK #-} !(ForeignPtr t) }
and some inlining for the bounds checks. Following bytestrings, we see the performance go from, with a 180M input file:
Goal: vsum1: $ ghc-core A.hs -optc-O2 -fvia-C -fbang-patterns $ time ./A data 4000 2500 0.9999727441161678
./A data 4000 2500 5.37s user 0.20s system 99% cpu 5.609 total 160,081,136 bytes allocated in the heap 155 Mb total memory in use
Before: vsum2-old:
$wgo_s1Ns = \ (ww3_s1Me :: Int#) (ww4_s1Mi :: Double#) -> case ww3_s1Me of ds_X17Y { __DEFAULT -> case Data.Packed.Internal.Vector.$wat @ Double Foreign.Storable.$f9 ww_s1Mu ww1_s1Mw (I# ds_X17Y) of wild1_a1Hg { D# y_a1Hi -> $wgo_s1Ns (-# ds_X17Y 1) (+## ww4_s1Mi y_a1Hi) }; 0 -> +## ww4_s1Mi ww2_s1M9 }; } in $wgo_s1Ns (-# ww_s1Mu 1) 0.0
./A data 4000 2500 +RTS -sstderr 0.999972744115876
./A data 4000 2500 +RTS -sstderr 6.04s user 0.15s system 99% cpu 6.203 total 1,121,416,400 bytes allocated in the heap 78 Mb total memory in use
After: True -> case readDoubleOffAddr# @ RealWorld ww1_s1Nr ds_X180 realWorld# of wild2_a1HS { (# s2_a1HU, x_a1HV #) -> case touch# @ ForeignPtrContents ww2_s1Ns s2_a1HU of s_a1HG { __DEFAULT -> $wgo_s1Ok (-# ds_X180 1) (+## ww4_s1Ng x_a1HV) }
No needless boxing.
$ time ./A data 4000 2500 +RTS -sstderr ./A data 4000 2500 +RTS -sstderr 0.999972744115876
./A data 4000 2500 +RTS -sstderr 5.41s user 0.10s system 99% cpu 5.548 total 80,071,072 bytes allocated in the heap
78 Mb total memory in use
And the total runtime and allocation now is better than the fully 'C' version. Yay.
Patch attached to the darcs version of hmatrix. Similar changes can likely be done to the other data types in hmatrix, for pretty much ideal code.
-- Don
participants (4)
-
Alberto Ruiz
-
Don Stewart
-
Tomas Andersson
-
Xiao-Yong Jin