
Hello, For a while now, I've been trying to come up with a fast Haskell-only function which implements Euclidean distance in n-dimensional space. So far, I've been disappointed by the performance of my best effort in Haskell, compared to C. I'm hoping some of the Haskell experts and/or performance gurus on this list can help me out on resolving this, and also maybe shed some light on some strange (imho) things I've run into. My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result: dist_fast :: UArr Double -> UArr Double -> Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds = zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y I've been benchmarking this function against various alternatives using the MicroBench [1] package, which allows to get accurate timings of single function calls. I've also mimicked the MicroBench approach in pure C, to get comparable timings for a C-only implementation. The C-only function is quite straightforward: double dist(int dim, double* p1, double* p2){ int i; double d = 0.0; for(i=0; i < dim; i++){ d += (p1[i] - p2[i])*(p1[i] - p2[i]); } return sqrt(d); } (Note that the optimizer takes care of the p1-p2 operation appearing twice in the code). All code is attached if you'd like to play around with it. All numbers reported below are using GHC 6.10.2 and gcc 4.3.3 on Linux/x86. The compilation details can be found in the Makefile attached, but in short, I'm using -O2 -fexcess-precision or -O2 -fexcess-precision -fvia-C -optc-O3 with GHC, and -O3 with gcc. Now the bad news: for small dimensions, e.g. 2D/3D-space, the dist_fast function is 70-240x slower than a pure C implementation, depending on the architecture. For example, in 2D-space on an Intel Pentium 4 (3.0GHz, 1M L2 cache), a single call to dist_fast takes about 1.75 microseconds (or 0.00000175s), while a call to dist_c (C implementation of Eucl. dist), takes about 0.025 microseconds (70x slowdown). On a Core 2 2.5GHz with 6MB L2 this maps to 1.9 and 0.008 microseconds, resp. (i.e. 240x slower), while on a Core i7 2.66GHz with 8MB L2 the numbers are 1.53 and 0.011 microseconds (i.e. 140x slower). For larger dimensions, the gap becomes less big, but is still worrying: 10D: 40-110x; 100D: 10-17x; >1kD: 2.5x-6x. I'm mostly interested in the range 10D to 100D, so seeing that Haskell is over 10x and up to 100x slower than C is kind of making me cry. I've tried some things to improve on this without much luck, on the contrary: *) Marking dist_fast for inlining makes things worse; in general the inlined version is 2x slower for low dimensionality, and even 5x slower for larger dimensionality. This was somewhat surprising to me... *) In a moment of weakness, I used the Foreign Function Interface to call the dist_c C-only implementation from Haskell. Unfortunately, there seems to be a lot of overhead in calling dist_c from Haskell. Most of the performance gain from using C melts away, and sometimes the performance of the FFI'd dist_c is 15-30% worse than the native dist_fast version (especially at low dimensionality). Only for the largest dimensionalities (10k-100kD), the FFI'd version reaches the performance of the native C approach. But, since I'm mostly interested in the 10-100D range, this is of little use to me. One thing I noticed is that compiling through C using -fvia-C -optc-O3 might be a bad idea, depending on your system. On an Intel Pentium 4 system, -fvia-C -optc-O3 was giving me a speedup of up 70% (large dim.), while on Core 2 and Core i7 it resulted in a slowdown of 15-20% ! I was using roughly equal versions of GCC with this, i.e. a self-bootstrapped GCC 4.3.x. So, my question to the list if simple: how can I get better performance out of a Haskell-only approach? Any comments/suggestions are highly appreciated. I'd prefer a Haskell-only approach, but my main concern is speed. The Euclidean distance function will be used quite heavily in various tools. I currently have a C-version of some of the tools, but the amount of code that is needed for those tools is becoming ridiculously big. I believe using Haskell will allow me to come up with a more easy to maintain code base. However, I don't want to pay a huge price for this in terms of performance. greetings, Kenneth [1] MicroBench: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/microbench -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.hoste@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be