Memory efficiency questions for real-time graphics

As my first Haskell exposure, I've been working through Real World Haskell. I am considering converting some of my C++ graphics libraries to Haskell. I've done a fair amount of googling on the subject, however I haven't quite been able to find clear answers to some of following issues. (1) Using an OpenGL vertex array (a contiguous chunk of memory which is handed to the graphics card) is important. I see the source code of Frag does this, so it looks like we're good. Check. (2) In-place modification of the vertex array is important. Every vertex changes on each frame update. And we always want more vertices. (3) Growing and shrinking the vertex array efficiently is important. Here the behavior of C++ std::vector happens to be ideal. When growing, extend the array in-place if possible (using reserved space if any), otherwise allocate a new chunk ("amortized constant time"). When shrinking, do nothing but record the smaller size; the unused memory chunk is reserved for possible future growth. (4) Doing little to no allocations each frame update is important. In the current C++ version, zero allocations occur during a normal frame update. When the vertex array changes, that is the only allocation which happens. To give a context for all of this, I am applying a non-linear transformation to an object on every frame. (Note: non-linear, so a matrix transform will not suffice.) The object is described by a certain function, and is generated from a 3D domain grid. The user can change the resolution of the grid on the fly, while the object is moving. (Hence the need for grow/shrink efficiency.) Given that (1) is out of the way, what's the best I expect from Haskell concerning (2)-(4)? Thanks, T. Willingham

t.r.willingham:
As my first Haskell exposure, I've been working through Real World Haskell.
I am considering converting some of my C++ graphics libraries to Haskell. I've done a fair amount of googling on the subject, however I haven't quite been able to find clear answers to some of following issues.
(1) Using an OpenGL vertex array (a contiguous chunk of memory which is handed to the graphics card) is important. I see the source code of Frag does this, so it looks like we're good. Check.
(2) In-place modification of the vertex array is important. Every vertex changes on each frame update. And we always want more vertices.
I imagine these are mutable arrays, so that's fine.
(3) Growing and shrinking the vertex array efficiently is important. Here the behavior of C++ std::vector happens to be ideal. When growing, extend the array in-place if possible (using reserved space if any), otherwise allocate a new chunk ("amortized constant time"). When shrinking, do nothing but record the smaller size; the unused memory chunk is reserved for possible future growth.
Easy enough. You're working with raw arrays of memory , I assume (e.g. UArrays or Ptr a?)
(4) Doing little to no allocations each frame update is important. In the current C++ version, zero allocations occur during a normal frame update. When the vertex array changes, that is the only allocation which happens.
This is certainly possible. To confirm it, look at the generated Core code, produced by GHC (with the ghc-core tool). See the realworldhaskell chapter on optimisation.
To give a context for all of this, I am applying a non-linear transformation to an object on every frame. (Note: non-linear, so a matrix transform will not suffice.)
The object is described by a certain function, and is generated from a 3D domain grid. The user can change the resolution of the grid on the fly, while the object is moving. (Hence the need for grow/shrink efficiency.)
Given that (1) is out of the way, what's the best I expect from Haskell concerning (2)-(4)?
Seems fine. You'll be working at a low level, with strict, mutable, unboxed data structures, but that's fine: the machine loves them. -- Don

On Mon, Oct 27, 2008 at 10:07 PM, Don Stewart
Seems fine. You'll be working at a low level, with strict, mutable, unboxed data structures, but that's fine: the machine loves them.
Thanks for the quick reply. One last question -- is it at all possible to segfault with strict, mutable, unboxed structures? I don't quite understand how it knows not to overwrite or underwrite. Cheers, T. Willingham

t.r.willingham:
On Mon, Oct 27, 2008 at 10:07 PM, Don Stewart
wrote: Seems fine. You'll be working at a low level, with strict, mutable, unboxed data structures, but that's fine: the machine loves them.
Thanks for the quick reply. One last question -- is it at all possible to segfault with strict, mutable, unboxed structures? I don't quite understand how it knows not to overwrite or underwrite.
It depends on the operations (safe indexing or unsafe indexing). Being strict or unboxed doesn't determine the safety. So be careful if you have a 'Ptr a' or start using unsafeWrite. -- Don

On Mon, Oct 27, 2008 at 11:04 PM, Don Stewart
It depends on the operations (safe indexing or unsafe indexing). Being strict or unboxed doesn't determine the safety.
OK, that makes sense. This is a huge load off my conscience. I can now dig into Real World Haskell without hesitation, knowing that what I want in the end is actually possible. Thanks again, TW

By the way, T, feel free to lean on me if you run into any problems.
I did something along the lines of what you were describing some time
ago, my particular non-linear transform being converting a vertex
array to/from polar coordinates and updating in realtime.
-- Jeff
On Tue, Oct 28, 2008 at 12:00 AM, T Willingham
On Mon, Oct 27, 2008 at 11:04 PM, Don Stewart
wrote: It depends on the operations (safe indexing or unsafe indexing). Being strict or unboxed doesn't determine the safety.
OK, that makes sense.
This is a huge load off my conscience. I can now dig into Real World Haskell without hesitation, knowing that what I want in the end is actually possible.
Thanks again, TW _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- I try to take things like a crow; war and chaos don't always ruin a picnic, they just mean you have to be careful what you swallow. -- Jessica Edwards

2008/10/28 T Willingham
As my first Haskell exposure, I've been working through Real World Haskell.
I am considering converting some of my C++ graphics libraries to Haskell. I've done a fair amount of googling on the subject, however I haven't quite been able to find clear answers to some of following issues.
(1) Using an OpenGL vertex array (a contiguous chunk of memory which is handed to the graphics card) is important. I see the source code of Frag does this, so it looks like we're good. Check.
(2) In-place modification of the vertex array is important. Every vertex changes on each frame update. And we always want more vertices.
(3) Growing and shrinking the vertex array efficiently is important. Here the behavior of C++ std::vector happens to be ideal. When growing, extend the array in-place if possible (using reserved space if any), otherwise allocate a new chunk ("amortized constant time"). When shrinking, do nothing but record the smaller size; the unused memory chunk is reserved for possible future growth.
(4) Doing little to no allocations each frame update is important. In the current C++ version, zero allocations occur during a normal frame update. When the vertex array changes, that is the only allocation which happens.
To give a context for all of this, I am applying a non-linear transformation to an object on every frame. (Note: non-linear, so a matrix transform will not suffice.)
Any reason why you can not do this in the vertex shader? You really should avoid trying to touch the vertices with the CPU if at all possible. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862

On Tue, Oct 28, 2008 at 3:24 PM, Sebastian Sylvan
2008/10/28 T Willingham
To give a context for all of this, I am applying a non-linear transformation to an object on every frame. (Note: non-linear, so a matrix transform will not suffice.)
Any reason why you can not do this in the vertex shader? You really should avoid trying to touch the vertices with the CPU if at all possible.
The per-vertex computation is a quite complex time-dependent function
applied to the given domain on each update. Yet even if it were
simple, I would still first implement the formulas in Haskell and
leave the optimization for later, if at all. The current C++
implementation which uses userland-memory vertex arrays already
performs very well.
On Sat, Nov 1, 2008 at 3:15 AM, Neal Alexander
Even when generating one or more copies of "world" per frame the performance stays fine and allocations are minimal.
Who says? That may be your particular experience from your particular tests. In my case, any copy of the "world" on each frame would have a catastrophic effect on the framerate, for any such definition of "world".
From what ive seen, the OpenGL calls are whats going to bottle neck.
Yes, that may as well be a tautology. The problem is sporadic lag and jittering from occasional large allocations and/or garbage collection from frequent small allocations. It's a unique situation where even the best profiling cannot pinpoint what is blatantly obvious to the human eye. Though the profiler may register it as 0.01%, the actual effect is glitchy behavior which comes off as unprofessional.

T Willingham wrote:
On Sat, Nov 1, 2008 at 3:15 AM, Neal Alexander
wrote: Even when generating one or more copies of "world" per frame the performance stays fine and allocations are minimal.
Who says? That may be your particular experience from your particular tests. In my case, any copy of the "world" on each frame would have a catastrophic effect on the framerate, for any such definition of "world".
Yea, this is just from my recent experience messing with a game engine in Haskell - I'm only a few months into it though. So far, the GC hasn't been anywhere close to having a problem keeping up with the monitors refresh rate. Even with several world states being folded into a frame. The memory usage is pretty flat too, at least with GLFW (the GLUT bindings had some issues there iirc). The test is pulling a pretty constant 1500 fps on this machine with the background + 1 character running around and resources being streamed in lazily. Not that that means much, but at the very least a GC run isn't noticeable on the current data set. Initially i expected this setup to perform badly, but i tried it anyway out of curiosity. We'll see how it goes with full sets of data later i guess heh.

wqeqweuqy:
T Willingham wrote:
On Sat, Nov 1, 2008 at 3:15 AM, Neal Alexander
wrote: Even when generating one or more copies of "world" per frame the performance stays fine and allocations are minimal.
Who says? That may be your particular experience from your particular tests. In my case, any copy of the "world" on each frame would have a catastrophic effect on the framerate, for any such definition of "world".
Yea, this is just from my recent experience messing with a game engine in Haskell - I'm only a few months into it though.
So far, the GC hasn't been anywhere close to having a problem keeping up with the monitors refresh rate. Even with several world states being folded into a frame.
The memory usage is pretty flat too, at least with GLFW (the GLUT bindings had some issues there iirc).
The test is pulling a pretty constant 1500 fps on this machine with the background + 1 character running around and resources being streamed in lazily. Not that that means much, but at the very least a GC run isn't noticeable on the current data set.
Initially i expected this setup to perform badly, but i tried it anyway out of curiosity. We'll see how it goes with full sets of data later i guess heh.
And, just to double check, you're compiling with a modern GHC, using say, -O2 -fvia-C -optc-O2 -funbox-strict-fields ? -- Don

Don Stewart wrote:
wqeqweuqy:
T Willingham wrote:
On Sat, Nov 1, 2008 at 3:15 AM, Neal Alexander
wrote: Even when generating one or more copies of "world" per frame the performance stays fine and allocations are minimal. Who says? That may be your particular experience from your particular tests. In my case, any copy of the "world" on each frame would have a catastrophic effect on the framerate, for any such definition of "world".
Yea, this is just from my recent experience messing with a game engine in Haskell - I'm only a few months into it though.
So far, the GC hasn't been anywhere close to having a problem keeping up with the monitors refresh rate. Even with several world states being folded into a frame.
The memory usage is pretty flat too, at least with GLFW (the GLUT bindings had some issues there iirc).
The test is pulling a pretty constant 1500 fps on this machine with the background + 1 character running around and resources being streamed in lazily. Not that that means much, but at the very least a GC run isn't noticeable on the current data set.
Initially i expected this setup to perform badly, but i tried it anyway out of curiosity. We'll see how it goes with full sets of data later i guess heh.
And, just to double check, you're compiling with a modern GHC, using say, -O2 -fvia-C -optc-O2 -funbox-strict-fields ?
-- Don Yea the optimization flags don't have any real effect on the FPS atm. The engine is bottlenecked by the OGL specific stuff.
The way it looks now i don't think theres going to be any performance problems - not that this is trying to be a bleeding edge 3d engine, but whatever. Tree data structures seem to adapt pretty well to being recreated each frame. I'm not sure how smart the GC is in regards to that though. Using matrices the same way performed really poorly. Replacing the matrix with a Region-QuadTrie caused a huge performance boost. IIRC the matrix stuff was using like 30% cpu according to the profile data and the Tree based approach used basicly 0% heh.

Even when generating one or more copies of "world" per frame the performance stays fine and allocations are minimal. From what ive seen, the OpenGL calls are whats going to bottle neck. loop (time, space) where loop = loop <=< runKleisli action where action = (ChronoSync.sync *** syncExternal channel) >>> Space.update >>> display >>> ChronoSync.yieldCPU
participants (5)
-
Don Stewart
-
Jefferson Heard
-
Neal Alexander
-
Sebastian Sylvan
-
T Willingham