
1) readArray m (i,j) yes, indeed. since we are dealing in bulk operations, we might as well take advantage of that, so dropping the repeated bounds-checks inside the loops makes a lot of sense.
no, i say here only about memory leaks. of course, unsafeRead omits bounds checking but more important in this case is that readArray created a temporary memory cells - index calculation of matrix turns out to be not strict. it was biggest surprise for me - i thrown a lot of time, adding 'seq' here and there before i even tried to replace this call with unsafeRead
i'm not so sure about that conclusion;) i admit that i more often optimise for time than for space, so unless there's a real space leak to get rid of, i tend to measure space performance indirectly, by its impact on time, which is perhaps not a good habit. but i did a step-by-step rerun of the modifications, to see their effect on (time) performance: time ./CG array 100000: 33s time ./CG_Bulat array 100000: 8s 33s: baseline, my original code 30 - strict formal pars in l, in dotA/matA 22 - inline l, for +*=/-*= 14 - replace readArray m (i,j) by unsafeRead m (index .. (i,j)), replace index by unsafeIndex, eliminating bounds-check 12 - same for readArray/writeArray v 12 - eliminating the tuple in readMatrix makes no difference 8 - seq-ing all parameters in l,*+=,dotA,matA to handle the 2d indexing, i replaced readArray m (i,j) by readMatrix m (i,j): {-# INLINE readMatrix #-} readMatrix m ij = unsafeRead m (unsafeIndex matrixBounds ij) matrixBounds :: ((Int,Int),(Int,Int)) matrixBounds = ((1,1),(n,n)) so we're still dealing with pairs, just got rid of the bounds-checks in readArray/index, and that alone brings us from 22s to 14s (12s if we do the same for vectors), a substantial improvement. eliminating the tuples, passing i and j directly into the computation, doesn't seem to make any further difference (shifting the indices to avoid the decrements might, but not by much, certainly not enough to justify extending the arrays;-), so just getting rid of the bounds-check had sufficiently exposed the index computation already. -- readMatrix m i j = unsafeRead m $! ((i-1)*n+(j-1)) ensuring strictness of all formal parameters in the basic vector/matrix operations, through bang-patterns or seq, brings us from 33s to 30s, and from 12s to 8s, so that helps a lot. the inline pragma on l brings us from 30s to 22s, so that helps a lot, too.
afaik, ghc can't inline recursive functions. it will be great if ghc can automatically make specialized version of function it can't inline. so i'm wonder whether INLINE really helps anything?
perhaps it can't unroll the (conceptually infinite) body of the loop, but it can bring copies of the definition to the places where the op parameters are known.
(let f x = .. in f $! par) vs (let f !x = .. in f par)
so the difference is between passing evaluated parameters into functions that don't expect them and passing parameters to functions that expect them evaluated. thanks, that makes sense to me: apart from the boxing/unboxing of evaluated parameters, the function body itself might look different. thanks, claus