[Haskell-cafe] Re: blas bindings, why are they so much slower the C?

Last month Anatoly Yakovenko published some disturbing numbers about the Haskell BLAS bindings I wrote being significantly slower than using plain C. I wanted to let everyone know that I've closed the performance gap, and now for doing ten million dot products, the overhead for using Haskell instead of C is about 0.6 seconds on my machine, regardless of the size of the vectors. The next version will incorporate the changes. If you can't wait for a formal release, the darcs repository is at http://www-stat.stanford.edu/~patperry/code/blas/ Anyone interested in more details can check out my blog: http://quantile95.com/2008/07/24/addressing-haskell-blas-performance-issues/ Thanks everyone for the input on this (especially Anatoly). If any else finds any performance discrepancies, please let me know and I will do whatever I can to fix them. Patrick

patperry:
Last month Anatoly Yakovenko published some disturbing numbers about the Haskell BLAS bindings I wrote being significantly slower than using plain C. I wanted to let everyone know that I've closed the performance gap, and now for doing ten million dot products, the overhead for using Haskell instead of C is about 0.6 seconds on my machine, regardless of the size of the vectors. The next version will incorporate the changes. If you can't wait for a formal release, the darcs repository is at http://www-stat.stanford.edu/~patperry/code/blas/
Anyone interested in more details can check out my blog: http://quantile95.com/2008/07/24/addressing-haskell-blas-performance-issues/
Thanks everyone for the input on this (especially Anatoly). If any else finds any performance discrepancies, please let me know and I will do whatever I can to fix them.
Great work, Patrick! So if I read correctly, the main change was to flatten the representation (and thus in loops the vector's structure will be unpacked and kept in registers, which isn't possible for sum types). -- Don

Yeah, I think that's where most of the performance gains came from. I also added a re-write rule for unsafeGet dot (since it doesn't matter if the arguments are conjugated or not if the vectors are real) that shaved off about a tenth of a second. Patrick On Jul 24, 2008, at 4:26 PM, Don Stewart wrote:
patperry:
Last month Anatoly Yakovenko published some disturbing numbers about the Haskell BLAS bindings I wrote being significantly slower than using plain C. I wanted to let everyone know that I've closed the performance gap, and now for doing ten million dot products, the overhead for using Haskell instead of C is about 0.6 seconds on my machine, regardless of the size of the vectors. The next version will incorporate the changes. If you can't wait for a formal release, the darcs repository is at http://www-stat.stanford.edu/~patperry/code/blas/
Anyone interested in more details can check out my blog: http://quantile95.com/2008/07/24/addressing-haskell-blas-performance-issues/
Thanks everyone for the input on this (especially Anatoly). If any else finds any performance discrepancies, please let me know and I will do whatever I can to fix them.
Great work, Patrick!
So if I read correctly, the main change was to flatten the representation (and thus in loops the vector's structure will be unpacked and kept in registers, which isn't possible for sum types).
-- Don
participants (2)
-
Don Stewart
-
Patrick Perry