
jsnow:
David Roundy wrote:
On Fri, Apr 18, 2008 at 02:09:28PM -0700, Jim Snow wrote:
On a particular scene with one instance of the single-threaded renderer running, it takes about 19 seconds to render an image. With two instances running, they each take about 23 seconds. This is on an Athlon-64 3800+ dual core, with 512kB of L2 cache per core. So, it seems my memory really is slowing things down noticeably.
This doesn't mean there's no hope, it just means that you'll need to be extra-clever if you're to get a speedup that is close to optimal. The key to overcoming memory bandwidth issues is to think about cache use and figure out how to improve it. For instance, O(N^3) matrix multiplication can be done in close to O(N^2) time provided it's memory-limited, by blocking memory accesses so that you access less memory at once.
In the case of ray-tracing I've little idea where or how you could improve memory access patterns, but this is what you should be thinking about (if you want to optimize this code). Of course, improving overall scaling is best (e.g. avoiding using lists if you need random access). Next I'd ask if there are more efficent or compact data structures that you could be using. If your program uses less memory, a greater fraction of that memory will fit into cache. Switching to stricter data structures and turning on -funbox-strict-fields (or whatever it's called) may help localize your memory access. Even better if you could manage to use unboxed arrays, so that your memory accesses really would be localized (assuming that you actually do have localize memory-access patterns).
Of course, also ask yourself how much memory your program is using in total. If it's not much more than 512kB, for instance, we may have misdiagnosed your problem.
Interestingly, switching between "Float" and "Double" doesn't make any noticeable difference in speed (though I see more rendering artifacts with Float). Transformation matrices are memory hogs, at 24 floats each (a 4x4 matrix and its inverse with the bottom rows omitted (they're always 0 0 0 1)). This may be one reason why many real-time ray tracers just stick with triangles; a triangle can be transformed by transforming its verticies, and then you can throw the matrix away.
The only differences I'd expect to see here would be with -fvia-C -fexcess-precision -O2 -optc-O2 which might trigger some SSE stuff from the C compiler