
On May 18, 2009, at 15:28 , Claus Reinke wrote:
My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result: dist_fast :: UArr Double -> UArr Double -> Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds = zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y
You'll probably want to make sure that 'euclidian' is specialized to the types you need (here 'Double'), not used overloaded for 'Num a=>a' (check -ddump-tc, or -ddump-simpl output).
I understand from your later post that is was in fact specialized, but how do I make sure it _is_ specialized? Can I just add a type signature in the dist_fast definition for euclidean, or should I define euclidean outside of dist_fast, with an explicit type signature? If the latter, won't that hurt performance? Or should marking it INLINE take care of that?
After that, unrolling the fused fold loop (uvector internal) might help a bit, but isn't there yet:
http://hackage.haskell.org/trac/ghc/ticket/3123 http://hackage.haskell.org/trac/ghc/wiki/Inlining
And even if that gets implemented, it doesn't apply directly to your case, where the loop is in a library, but you might want to control its unrolling in your client code. Having the loop unrolled by a default factor (8x or so) should help for loops like this, with little computation.
This seems rather serious, and might be one of the bigger reasons why I'm getting nowhere close to C in terms of performance... The loop body is ridiculously small, so it would make sense to unroll it somewhat to help avoid the loop overhead. However, it seems like GHC isn't able to do that now. Is there any way to unroll the loop myself, to speed things up? Seems hard, because I'm using uvector... K. -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.hoste@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be