readArray is faster than unsafeRead

Hello, I'm writing some matrix multiplication and inversion functions for small matrices (3x3 and 4x4 mostly, for 3d graphics, modeling, simulation, etc.) I noticed that the matrix multiplication was a bottleneck so I set out to optimize and found that using unsafeRead instead of (!) (or readArray in stateful code) helped a lot. So then I went to optimize my gaussian elimination function and found just the opposite. unsafeRead is slower than readArray. This struck me as very odd considering that readArray calls unsafeRead. If there is a "good" reason why the compiler optimized readArray better than unsafeRead, I'd like to know what it is so that I can make all my array code safe as well as fast. (By "good" reason I mean something deterministic and repeatable, not just luck.) On the otherhand, if this is a fluke, I'm inclined to think that it's not the safe code which is freakishly fast, but the unsafe code which is needlessly slow. That is, something about my program is hindering optimization of the unsafe code. What is it? Attached is the profiling results and a test program with a handful of matrix multiplication and gaussian elimination functions to illustrate what I've seen. This happens both on amd64 and intel core architectures. Thanks for any insight, Scott

That is indeed odd. As I recall unsafeRead doesn't do array-bound checking, and that ought to be faster, always. I don't have time to look into this myself (soon, anyway), but a) if someone does (look at Core!) I'd be interested to know the outcome b) unless it turns out to be a red herring, would you like to add a Trac bug report (we really ought to have a category for "performance bug") giving the smallest program you can that demonstrates the unexpected behaviour? Simon | -----Original Message----- | From: glasgow-haskell-users-bounces@haskell.org [mailto:glasgow-haskell-users-bounces@haskell.org] On | Behalf Of Scott Dillard | Sent: 29 May 2007 06:22 | To: glasgow-haskell-users@haskell.org | Subject: readArray is faster than unsafeRead | | Hello, | | I'm writing some matrix multiplication and inversion functions for | small matrices (3x3 and 4x4 mostly, for 3d graphics, modeling, | simulation, etc.) I noticed that the matrix multiplication was a | bottleneck so I set out to optimize and found that using unsafeRead | instead of (!) (or readArray in stateful code) helped a lot. So then I | went to optimize my gaussian elimination function and found just the | opposite. unsafeRead is slower than readArray. This struck me as very | odd considering that readArray calls unsafeRead. | | If there is a "good" reason why the compiler optimized readArray | better than unsafeRead, I'd like to know what it is so that I can make | all my array code safe as well as fast. (By "good" reason I mean | something deterministic and repeatable, not just luck.) | | On the otherhand, if this is a fluke, I'm inclined to think that it's | not the safe code which is freakishly fast, but the unsafe code which | is needlessly slow. That is, something about my program is hindering | optimization of the unsafe code. What is it? | | Attached is the profiling results and a test program with a handful of | matrix multiplication and gaussian elimination functions to illustrate | what I've seen. This happens both on amd64 and intel core | architectures. | | | Thanks for any insight, | Scott

instead of (!) (or readArray in stateful code) helped a lot. So then I went to optimize my gaussian elimination function and found just the opposite. unsafeRead is slower than readArray. This struck me as very
The functions gaussElimSafe and gaussElimUnsafe are different in their outputs (*). I guess you made a mistake during conversion from one version to the other. BTW, the attached file contains an outline of a pure functional and indexless implementation. It calculates the inverse of a quadratic matrix in time O(n^3), so maybe you can get rid of all this exclamation marks and IO stuff... BR, (*) Try: newStdGen >>= \ g -> flip mapM_ [gaussElimSafe,gaussElimUnsafe] $ \ f -> makeMatrix g >>= \ m -> printMatrix m >> f m >> printMatrix m -- -- Mirko Rahn -- Tel +49-721 608 7504 -- --- http://liinwww.ira.uka.de/~rahn/ ---

You're right. I broke them when experimenting to see where the cycles
were going. I've fixed them, but the other pair of functions, named
gaussElim2, were correct and the issue remains that unsafeRead appears
slower than readArray.
I've included your implementation for comparison, as well as a simple
pure matrix multiplication function, but I'm having trouble tracking
them down in the profiling. I think all of their cycles are being
counted under main, but I'm not sure how to stop this. They certainly
"feel" faster when I sit there and watch the program run, which is
great.
Why would lists-of-lists be faster than unboxed arrays? No indexing
arithmetic? Deforestation? I'm very curious. The first advice I got
from #haskell when trying to speed up my original code was "get rid of
all those lists."
Scott
On 5/29/07, Mirko Rahn
instead of (!) (or readArray in stateful code) helped a lot. So then I went to optimize my gaussian elimination function and found just the opposite. unsafeRead is slower than readArray. This struck me as very
The functions gaussElimSafe and gaussElimUnsafe are different in their outputs (*). I guess you made a mistake during conversion from one version to the other.
BTW, the attached file contains an outline of a pure functional and indexless implementation. It calculates the inverse of a quadratic matrix in time O(n^3), so maybe you can get rid of all this exclamation marks and IO stuff...
BR,
(*) Try:
newStdGen >>= \ g -> flip mapM_ [gaussElimSafe,gaussElimUnsafe] $ \ f -> makeMatrix g >>= \ m -> printMatrix m >> f m >> printMatrix m
-- -- Mirko Rahn -- Tel +49-721 608 7504 -- --- http://liinwww.ira.uka.de/~rahn/ ---

Hello Scott, Wednesday, May 30, 2007, 1:42:22 AM, you wrote:
I've included your implementation for comparison, as well as a simple pure matrix multiplication function, but I'm having trouble tracking them down in the profiling. I think all of their cycles are being counted under main, but I'm not sure how to stop this.
this may be due to inlining. if you stop inlining, things will start count properly but perform much slower :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Why would lists-of-lists be faster than unboxed arrays? No indexing arithmetic? Deforestation? I'm very curious. The first advice I got from #haskell when trying to speed up my original code was "get rid of all those lists."
First, I think lists-of-lists is only faster (if at all) in your special case of having only very small matrices. Moreover, the pure implementation runs without the 'near zero' tests and the array implementation is not as smart as possible. For example instead of f j1 where f j | j>n = return () ; f j = work_on j >> f (j+1) you could simply write mapM_ work_on [j1..n] and save some comparsions. Second, clearly you have to "get rid of all those lists", as long as you are trying to implement the algorithm in the usual algebra book style, where you find formulations that are suitable to mutable array implementations. The pure implementation instead tries to exploit the recursive structure and some invariants of Gauss' algorithm in a direct way. BR, -- -- Mirko Rahn -- Tel +49-721 608 7504 -- --- http://liinwww.ira.uka.de/~rahn/ ---

On Tue, May 29, 2007 at 02:42:22PM -0700, Scott Dillard wrote:
You're right. I broke them when experimenting to see where the cycles were going. I've fixed them, but the other pair of functions, named gaussElim2, were correct and the issue remains that unsafeRead appears slower than readArray.
Separating the functions into different executables and compiling with GHC 6.6 without profiling, the unsafe one is definitely faster: ./safe 45.73s user 0.06s system 96% cpu 47.238 total ./unsafe 6.65s user 0.02s system 91% cpu 7.295 total Thanks Ian
participants (5)
-
Bulat Ziganshin
-
Ian Lynagh
-
Mirko Rahn
-
Scott Dillard
-
Simon Peyton-Jones