
Hi Kenneth, I wrote a benchmark similar to yours using the haskell blas library I wrote (latest version is on github at http://github.com/patperry/blas/tree/master , an older version is on hackage). The pure Haskell implementation is not very good, and the problem seems to be repeated boxing/unboxing of doubles. It may be possible to avoid this by writing a custom "fold" function for Vectors, but I haven't tried it. Using a vectorized subtraction function along with the BLAS1 "dnrm2" function, I was able to get performance about 2.8-5.4x worse than C. This isn't great, but it's not terrible, either. I didn't have the performance problems you ran in to using the FFI. My benchmark includes: distVectors1: a native implementation using foldl' and zipWith distVectors2: a vectorize implementation, defined as @distVectors2 x y = norm2 (x-y)@ distVectors3: a custom loop in haskell distVectors4: an FFI call to C All 4 functions use the "Vector" type, provided by the "blas" package. I compiled with "-O -fexcess-preceision" and ran timings for vectors of size 10, 100, and 1000. Here are the results: Macintosh-62:~/Code/blas/examples (euclidean)$ ./Euclidean 10 n: 10 * distVectors1: ..................... 1.366ns per iteration / 732029.06 per second. * distVectors2: ..................... 1.286ns per iteration / 777734.02 per second. * distVectors3: ...................... 0.950ns per iteration / 1052847.93 per second. * distVectors4: ....................... 0.452ns per iteration / 2213752.56 per second. Macintosh-62:~/Code/blas/examples (euclidean)$ ./Euclidean 100 n: 100 * distVectors1: .................. 12.496ns per iteration / 80025.79 per second. * distVectors2: .................... 2.149ns per iteration / 465311.61 per second. * distVectors3: .................. 8.728ns per iteration / 114572.53 per second. * distVectors4: ...................... 0.597ns per iteration / 1675765.65 per second. Macintosh-62:~/Code/blas/examples (euclidean)$ ./Euclidean 1000 n: 1000 * distVectors1: .............. 125.920ns per iteration / 7941.52 per second. * distVectors2: .................. 10.677ns per iteration / 93661.99 per second. * distVectors3: ............... 85.533ns per iteration / 11691.42 per second. * distVectors4: .................... 1.962ns per iteration / 509668.17 per second. The "distVectors2" function is computing the norm in a more numerically-stable way. It is also allocating a temporary array to store x-y. I haven't looked over your benchmarking code too thoroughly, but parts of it seem suspect to me. Namely, you are using "mapM" instead of "mapM_". Also, it seems like you are including GC times in your benchmarks. Attached is my code. You will need the latest "blas" from github to compile it. You also might want to look into hmatrix. I have very little time for haskell code right now (I'm trying to finish my Ph.D. in the next month). This summer I plan to work more on this stuff, and I'll probably make a release in July. Patrick