announce: Glome.hs raytracer

I have recently posted a haskell port of my ocaml raytracer, Glome: http://syn.cs.pdx.edu/~jsnow/glome/ It supports spheres and triangles as base primitives, and is able to parse files in the NFF format used by the standard procedural database (http://tog.acm.org/resources/SPD/). It uses a bounding interval heirarchy acceleration structure, so it can render fairly complicated scenes in a reasonable amount of time. Shadows and reflections are supported, but not specular highlights or refraction. It's still slower than the ocaml version, but at least they're in the same ballpark (and a good part of that difference may be inefficiencies in my BIH traversal). I would welcome any advice on making it go faster or use less memory. I compile the program with "ghc Glome.hs --make -fasm -O2 -threaded -fglasgow-exts -funbox-strict-fields -fbang-patterns -fexcess-precision -optc-ffast-math -optc-O2 -optc-mfpmath=sse -optc-msse2". (I assume the -optc options don't do anything unless you compile via C.) Here are some of my current concerns: -Multi-core parallelism is working, but not as well as I'd expect: I get about a 25% reduction in runtime on two cores rather than 50%. I split the default screen size of 512x512 into 16 blocks, and run "parMap" on those blocks with a function that turns the screen coordinates of that block into a list of (x,y,r,g,b) tuples that get drawn as pixels to the screen through OpenGL by the original thread. -Memory consumption is atrocious: 146 megs to render a scene that's a 33k ascii file. Where does it all go? A heap profile reports the max heap size at a rather more reasonable 500k or so. (My architecture is 64 bit ubuntu on a dual-core amd.) -Collecting rendering stats is not easy without global variables. It occurs to me that it would be neat if there were some sort of write-only global variables that can be incremented by pure code but can only be read from within monadic code; that would be sufficient to ensure that the pure code wasn't affected by the values. The sorts of things I'm looking for are the number of calls to "trace" per image, the number of BIH branches traversed and ray/triangle and ray/sphere intersections per pixel. (Disclaimer: I don't really fully understand monads, so I may be oblivious to an obvious solution.) -Is there a fast way to cast between Float and Double? I'm using Float currently, and the only reason is because that's what the OpenGL api expects. I'd like to be able to use either representation, but the only way to cast that I've found so far is "float_conv x = fromRational(toRational x)", which is too slow. thanks, -jim

jsnow:
I have recently posted a haskell port of my ocaml raytracer, Glome:
http://syn.cs.pdx.edu/~jsnow/glome/
It supports spheres and triangles as base primitives, and is able to parse files in the NFF format used by the standard procedural database (http://tog.acm.org/resources/SPD/). It uses a bounding interval heirarchy acceleration structure, so it can render fairly complicated scenes in a reasonable amount of time. Shadows and reflections are supported, but not specular highlights or refraction.
It's still slower than the ocaml version, but at least they're in the same ballpark (and a good part of that difference may be inefficiencies in my BIH traversal). I would welcome any advice on making it go faster or use less memory.
I compile the program with "ghc Glome.hs --make -fasm -O2 -threaded -fglasgow-exts -funbox-strict-fields -fbang-patterns -fexcess-precision -optc-ffast-math -optc-O2 -optc-mfpmath=sse -optc-msse2". (I assume the -optc options don't do anything unless you compile via C.)
Here are some of my current concerns:
-Multi-core parallelism is working, but not as well as I'd expect: I get about a 25% reduction in runtime on two cores rather than 50%. I split the default screen size of 512x512 into 16 blocks, and run "parMap" on those blocks with a function that turns the screen coordinates of that block into a list of (x,y,r,g,b) tuples that get drawn as pixels to the screen through OpenGL by the original thread.
-Memory consumption is atrocious: 146 megs to render a scene that's a 33k ascii file. Where does it all go? A heap profile reports the max heap size at a rather more reasonable 500k or so. (My architecture is 64 bit ubuntu on a dual-core amd.)
-Collecting rendering stats is not easy without global variables. It occurs to me that it would be neat if there were some sort of write-only global variables that can be incremented by pure code but can only be read from within monadic code; that would be sufficient to ensure that the pure code wasn't affected by the values. The sorts of things I'm looking for are the number of calls to "trace" per image, the number of BIH branches traversed and ray/triangle and ray/sphere intersections per pixel. (Disclaimer: I don't really fully understand monads, so I may be oblivious to an obvious solution.)
Use a Writer monad to log statistics? Maybe a State monad.
-Is there a fast way to cast between Float and Double? I'm using Float currently, and the only reason is because that's what the OpenGL api expects. I'd like to be able to use either representation, but the only way to cast that I've found so far is "float_conv x = fromRational(toRational x)", which is too slow.
I'd try realToFrac, which should be pretty much optimised away. With doubles, ensure you use -fexcess-precision -- Don

On Wed, 26 Mar 2008, Don Stewart wrote:
jsnow:
-Is there a fast way to cast between Float and Double? I'm using Float currently, and the only reason is because that's what the OpenGL api expects. I'd like to be able to use either representation, but the only way to cast that I've found so far is "float_conv x = fromRational(toRational x)", which is too slow.
I'd try realToFrac, which should be pretty much optimised away.

On Wed, 2008-03-26 at 14:45 -0700, Don Stewart wrote:
jsnow:
I have recently posted a haskell port of my ocaml raytracer, Glome:
http://syn.cs.pdx.edu/~jsnow/glome/
It supports spheres and triangles as base primitives, and is able to parse files in the NFF format used by the standard procedural database (http://tog.acm.org/resources/SPD/). It uses a bounding interval heirarchy acceleration structure, so it can render fairly complicated scenes in a reasonable amount of time. Shadows and reflections are supported, but not specular highlights or refraction.
It's still slower than the ocaml version, but at least they're in the same ballpark (and a good part of that difference may be inefficiencies in my BIH traversal). I would welcome any advice on making it go faster or use less memory.
I compile the program with "ghc Glome.hs --make -fasm -O2 -threaded -fglasgow-exts -funbox-strict-fields -fbang-patterns -fexcess-precision -optc-ffast-math -optc-O2 -optc-mfpmath=sse -optc-msse2". (I assume the -optc options don't do anything unless you compile via C.)
Here are some of my current concerns:
-Multi-core parallelism is working, but not as well as I'd expect: I get about a 25% reduction in runtime on two cores rather than 50%. I split the default screen size of 512x512 into 16 blocks, and run "parMap" on those blocks with a function that turns the screen coordinates of that block into a list of (x,y,r,g,b) tuples that get drawn as pixels to the screen through OpenGL by the original thread.
-Memory consumption is atrocious: 146 megs to render a scene that's a 33k ascii file. Where does it all go? A heap profile reports the max heap size at a rather more reasonable 500k or so. (My architecture is 64 bit ubuntu on a dual-core amd.)
-Collecting rendering stats is not easy without global variables. It occurs to me that it would be neat if there were some sort of write-only global variables that can be incremented by pure code but can only be read from within monadic code; that would be sufficient to ensure that the pure code wasn't affected by the values. The sorts of things I'm looking for are the number of calls to "trace" per image, the number of BIH branches traversed and ray/triangle and ray/sphere intersections per pixel. (Disclaimer: I don't really fully understand monads, so I may be oblivious to an obvious solution.)
Use a Writer monad to log statistics? Maybe a State monad.
-Is there a fast way to cast between Float and Double? I'm using Float currently, and the only reason is because that's what the OpenGL api expects. I'd like to be able to use either representation, but the only way to cast that I've found so far is "float_conv x = fromRational(toRational x)", which is too slow.
I'd try realToFrac, which should be pretty much optimised away.
With doubles, ensure you use -fexcess-precision
Unless something has changed, you also want to be compiling with -fvia-C if you are going to be doing floating point intensive computations.

On Wed, Mar 26, 2008 at 2:33 PM, Jim Snow
-Memory consumption is atrocious: 146 megs to render a scene that's a 33k ascii file. Where does it all go? A heap profile reports the max heap size at a rather more reasonable 500k or so. (My architecture is 64 bit ubuntu on a dual-core amd.)
Try retainer profiling to see who's holding on to memory. To track down a suspect, add SCC annotations and filter the retainer profile to just those annotations (-hC option). Heap profiling is documented at http://haskell.org/ghc/docs/latest/html/users_guide/prof-heap.html Justin

Hello Jim, Thursday, March 27, 2008, 12:33:20 AM, you wrote:
-Multi-core parallelism is working, but not as well as I'd expect: I get about a 25% reduction in runtime on two cores rather than 50%. I split
this may be an effect of limited memory bandwidth
-Memory consumption is atrocious: 146 megs to render a scene that's a
standard answer: ByteString
-Collecting rendering stats is not easy without global variables. It occurs to me that it would be neat if there were some sort of write-only global variables that can be incremented by pure code but can only be read from within monadic code; that would be sufficient to ensure that the pure code wasn't affected by the values.
the code is called *pure* exactly because it has no side-effects and compiler may select either to call some function two times or reuse already computed result. actually, you can make sideeffects with unsafePerformIO, but there is no guarantees of how many times such code will be executed. try this: plus a b = unsafePerformIO (modifyIORef counter (+1)) `seq` a+b -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Thu, Mar 27, 2008 at 01:09:47AM +0300, Bulat Ziganshin wrote:
-Collecting rendering stats is not easy without global variables. It occurs to me that it would be neat if there were some sort of write-only global variables that can be incremented by pure code but can only be read from within monadic code; that would be sufficient to ensure that the pure code wasn't affected by the values.
the code is called *pure* exactly because it has no side-effects and compiler may select either to call some function two times or reuse already computed result. actually, you can make sideeffects with unsafePerformIO, but there is no guarantees of how many times such code will be executed. try this:
plus a b = unsafePerformIO (modifyIORef counter (+1)) `seq` a+b
This is exactly what he wants to do. The point of putting traces into the code is precisely to figure out how many times it is called. The only trouble is that unsafePerformIO (I believe) can inhibit optimizations, since there are certain transformations that ghc won't do to unsafePerformIO code. -- David Roundy Department of Physics Oregon State University

droundy:
On Thu, Mar 27, 2008 at 01:09:47AM +0300, Bulat Ziganshin wrote:
-Collecting rendering stats is not easy without global variables. It occurs to me that it would be neat if there were some sort of write-only global variables that can be incremented by pure code but can only be read from within monadic code; that would be sufficient to ensure that the pure code wasn't affected by the values.
the code is called *pure* exactly because it has no side-effects and compiler may select either to call some function two times or reuse already computed result. actually, you can make sideeffects with unsafePerformIO, but there is no guarantees of how many times such code will be executed. try this:
plus a b = unsafePerformIO (modifyIORef counter (+1)) `seq` a+b
This is exactly what he wants to do. The point of putting traces into the code is precisely to figure out how many times it is called. The only trouble is that unsafePerformIO (I believe) can inhibit optimizations, since there are certain transformations that ghc won't do to unsafePerformIO code.
could we just use -fhpc or profiling here. HPC at least will tell you how many times top level things are called, and print pretty graphs about it.

On Wed, Mar 26, 2008 at 05:07:10PM -0700, Don Stewart wrote:
droundy:
On Thu, Mar 27, 2008 at 01:09:47AM +0300, Bulat Ziganshin wrote:
-Collecting rendering stats is not easy without global variables. It occurs to me that it would be neat if there were some sort of write-only global variables that can be incremented by pure code but can only be read from within monadic code; that would be sufficient to ensure that the pure code wasn't affected by the values.
the code is called *pure* exactly because it has no side-effects and compiler may select either to call some function two times or reuse already computed result. actually, you can make sideeffects with unsafePerformIO, but there is no guarantees of how many times such code will be executed. try this:
plus a b = unsafePerformIO (modifyIORef counter (+1)) `seq` a+b
This is exactly what he wants to do. The point of putting traces into the code is precisely to figure out how many times it is called. The only trouble is that unsafePerformIO (I believe) can inhibit optimizations, since there are certain transformations that ghc won't do to unsafePerformIO code.
could we just use -fhpc or profiling here. HPC at least will tell you how many times top level things are called, and print pretty graphs about it.
It depends what the point is. I've found traces to be very helpful at times when debugging (for instance, to get values as well as counts). Also, I imagine that manual tracing is likely to be far less invasive (if you do it somewhat discretely) than profiling or using hpc. -- David Roundy Department of Physics Oregon State University

David Roundy wrote:
On Wed, Mar 26, 2008 at 05:07:10PM -0700, Don Stewart wrote:
droundy:
On Thu, Mar 27, 2008 at 01:09:47AM +0300, Bulat Ziganshin wrote:
-Collecting rendering stats is not easy without global variables. It occurs to me that it would be neat if there were some sort of write-only global variables that can be incremented by pure code but can only be read from within monadic code; that would be sufficient to ensure that the pure code wasn't affected by the values.
the code is called *pure* exactly because it has no side-effects and compiler may select either to call some function two times or reuse already computed result. actually, you can make sideeffects with unsafePerformIO, but there is no guarantees of how many times such code will be executed. try this:
plus a b = unsafePerformIO (modifyIORef counter (+1)) `seq` a+b
This is exactly what he wants to do. The point of putting traces into the code is precisely to figure out how many times it is called. The only trouble is that unsafePerformIO (I believe) can inhibit optimizations, since there are certain transformations that ghc won't do to unsafePerformIO code.
could we just use -fhpc or profiling here. HPC at least will tell you how many times top level things are called, and print pretty graphs about it.
It depends what the point is. I've found traces to be very helpful at times when debugging (for instance, to get values as well as counts). Also, I imagine that manual tracing is likely to be far less invasive (if you do it somewhat discretely) than profiling or using hpc.
The unsafePerformIO looks like what I want. Profiling isn't really that helpful in this situation, since sometimes what you want is the number of times something gets called per ray and then add a bit to the color value of the corresponding pixel. Something like this http://syn.cs.pdx.edu/~jsnow/glome/dragon-bih.png tells you a lot more about where your code is spending its time (the bright green places) than some numbers from a profiler. I could return the relevant stats as part of the standard results from ray-intersection tests, but I think that would clutter the code unnecessarily. Thanks everyone for the advice, it'll keep me busy for awhile. I got converted over to doubles, it seems to be about 10% faster or so with -fvia-C than regular floats with -fasm. (I'm using ghc 6.8.2 by the way, which seems to generate faster code than the 6.6.1 version I was using earlier, so maybe the difference between -fasm and -fvia-C isn't as significant as it used to be.) I'm looking into using ByteString, but it doesn't seem compatible with "lex" and "reads". I should probably do more heap profiling before I get too carried away, though. -jim

Hello Andrew, Thursday, March 27, 2008, 12:27:47 PM, you wrote:
plus a b = unsafePerformIO (modifyIORef counter (+1)) `seq` a+b Erm... might it be better to use an MVar? (To avoid lost updates if there are multiple render threads.)
you are right, IORef is appropriate only for single-threaded program -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Wed, Mar 26, 2008 at 02:33:20PM -0700, Jim Snow wrote:
-Memory consumption is atrocious: 146 megs to render a scene that's a 33k ascii file. Where does it all go? A heap profile reports the max heap size at a rather more reasonable 500k or so. (My architecture is 64 bit ubuntu on a dual-core amd.)
I haven't looked properly yet, but it looks like something is leaking memory that shouldn't be. The attached Gloom.hs uses constant memory, but if you replace the "map" with the commented out "(parMap rnf)" then the memory use seems to keep increasing, even once it has run display once and is running it a second or third time. Thanks Ian

On 27/03/2008, at 3:49, Ian Lynagh wrote:
On Wed, Mar 26, 2008 at 02:33:20PM -0700, Jim Snow wrote:
-Memory consumption is atrocious: 146 megs to render a scene that's a 33k ascii file. Where does it all go? A heap profile reports the max heap size at a rather more reasonable 500k or so. (My architecture is 64 bit ubuntu on a dual-core amd.)
I haven't looked properly yet, but it looks like something is leaking memory that shouldn't be. The attached Gloom.hs uses constant memory, but if you replace the "map" with the commented out "(parMap rnf)" then the memory use seems to keep increasing, even once it has run display once and is running it a second or third time.
In my system the leak only appears with +RTS -N1 (which is the default). If I use -N2 or higher, then your version runs in constant memory with (parmap rnf). Cheers pepe

pepe wrote:
On 27/03/2008, at 3:49, Ian Lynagh wrote:
On Wed, Mar 26, 2008 at 02:33:20PM -0700, Jim Snow wrote:
-Memory consumption is atrocious: 146 megs to render a scene that's a 33k ascii file. Where does it all go? A heap profile reports the max heap size at a rather more reasonable 500k or so. (My architecture is 64 bit ubuntu on a dual-core amd.)
I haven't looked properly yet, but it looks like something is leaking memory that shouldn't be. The attached Gloom.hs uses constant memory, but if you replace the "map" with the commented out "(parMap rnf)" then the memory use seems to keep increasing, even once it has run display once and is running it a second or third time.
In my system the leak only appears with +RTS -N1 (which is the default). If I use -N2 or higher, then your version runs in constant memory with (parmap rnf).
Cheers pepe
Using Ian Lynagh's Gloom.hs (I'm not sure if that's a typo, but it's a convenient way to distinguish it from my original Glome.hs): With parMap and +RTS -N2, I get 59 megs total mapped memory, 18 megs resident all three iterations. With parMap and +RTS -N1, I get 53/21, then 99/66, then 145/112 megs total/resident. With map and no RTS options, memory use is 37/4.8 all three iterations. I'm using ghc 6.8.2 on 64-bit ubuntu. -jim

On Thu, Mar 27, 2008 at 02:49:35AM +0000, Ian Lynagh wrote:
On Wed, Mar 26, 2008 at 02:33:20PM -0700, Jim Snow wrote:
-Memory consumption is atrocious: 146 megs to render a scene that's a 33k ascii file. Where does it all go? A heap profile reports the max heap size at a rather more reasonable 500k or so. (My architecture is 64 bit ubuntu on a dual-core amd.)
I haven't looked properly yet, but it looks like something is leaking memory that shouldn't be.
Bug filed here: http://hackage.haskell.org/trac/ghc/ticket/2185 Thanks Ian
participants (10)
-
Andrew Coppin
-
Bulat Ziganshin
-
David Roundy
-
Derek Elkins
-
Don Stewart
-
Henning Thielemann
-
Ian Lynagh
-
Jim Snow
-
Justin Bailey
-
pepe