
Hi Bulat,
obviously, arrays version should create no temporary cells.
that's why the memory traffic surprised me. i knew there had to be something wrong.
the problems was mainly due to 2 factors: 1) readArray m (i,j)
yes, indeed. since we are dealing in bulk operations, we might as well take advantage of that, so dropping the repeated bounds-checks inside the loops makes a lot of sense. moral/note to self: bulk array operations are your friend (i knew that!-), but you need to use that when defining them (unsafeRead et al are only for library writers, but library writers ought to use them; and i was writing a small inlined library)
2) 'op' in 'l' which was passed as real closure and was not inlined due to weakness of ghc optimizer
it irked me having to write the same loop twice, but i didn't consider the consequences. an INLINE pragma on l almost seems sufficient to regain the loss, so i prefer that; but writing out the loop twice is still a tiny tick faster..
we should help strictness analyzer by marking all the variables used in tight loops as strict.
ah, there were a few surprises in that one. i thought i had considered possible strictness problems, but obviously, i missed a few relevant possibilities. annotating everything is the safe option, and clearly documents the intentions, but i cannot help thinking about which annotations could be omitted: - modArray: a and i are both used anyway - i index in loop is definitely checked (but j isn't, and some others weren't, either; so better safe than sorry) - some of the parameters need not be annotated in this example, but should be if one wanted to reuse the code elsewhere - the one i really don't get yet is the one on the accumulators (!s in l, in dotA/matA); i thought i had covered that by using $! in applying the loop, but annotating the formal loop parameter, apart from being nicer, also seems faster.. moral/note to self: don't try to be clever, try to be clear..; strictness in formal parameters is better than strictness in actual parameters; bang-patterns are good;-)
after that is done, we got 1000 times less temporary data allocated and 5x faster execution. now it's a bit faster than strict lists
is this now anywhere near what the number-crunchers use, when they use Haskell?-) Bulat, thanks for looking into it and for isolating the issues so quickly!-) claus