OpenGL performance issue on OSX

I'm using OpenGLRaw and I'm getting 3 frames per second trying to draw 8000 triangles. When I profile the application, I see that almost all of the time is taken by a call to CGLFlushDrawable, which is apparently OSX's function for swapping the back buffer to the front buffer. Has anyone else run into a problem like this? I seem to recall a thread a while ago about a Haskell specific OpenGL performance issue, but I can't find it now.

On May 20, 2014, at 12:20 AM, Michael Baker
wrote: I'm using OpenGLRaw and I'm getting 3 frames per second trying to draw 8000 triangles. When I profile the application, I see that almost all of the time is taken by a call to CGLFlushDrawable, which is apparently OSX's function for swapping the back buffer to the front buffer.
Can you put the code somewhere so we can take a look? OpenGL offers 8000 ways to draw 8000 triangles. Anthony
Has anyone else run into a problem like this? I seem to recall a thread a while ago about a Haskell specific OpenGL performance issue, but I can't find it now. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Mon, May 19, 2014 at 11:43 PM, Anthony Cowley
On May 20, 2014, at 12:20 AM, Michael Baker
wrote: I'm using OpenGLRaw and I'm getting 3 frames per second trying to draw 8000 triangles. When I profile the application, I see that almost all of the time is taken by a call to CGLFlushDrawable, which is apparently OSX's function for swapping the back buffer to the front buffer.
Can you put the code somewhere so we can take a look? OpenGL offers 8000 ways to draw 8000 triangles.
Anthony
Has anyone else run into a problem like this? I seem to recall a thread
a while ago about a Haskell specific OpenGL performance issue, but I can't find it now.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Here's the code https://gist.github.com/MichaelBaker/4429c93f2aca04bc79bb. I think I have everything important in there. Line 53 is the one that causes the slow down. I guess some things to note are that "Triangle" is Storable and the vector I'm creating on line 14 and writing to on line 47 is a mutable storable vector. Also, thanks Alp, I'll look into this when I get home. Although I don't think my computer has more than one graphics card (it's a Macbook Air).

On Tue, May 20, 2014 at 8:36 AM, Michael Baker
On Mon, May 19, 2014 at 11:43 PM, Anthony Cowley
wrote: On May 20, 2014, at 12:20 AM, Michael Baker
wrote: I'm using OpenGLRaw and I'm getting 3 frames per second trying to draw 8000 triangles. When I profile the application, I see that almost all of the time is taken by a call to CGLFlushDrawable, which is apparently OSX's function for swapping the back buffer to the front buffer.
Can you put the code somewhere so we can take a look? OpenGL offers 8000 ways to draw 8000 triangles.
Anthony
Has anyone else run into a problem like this? I seem to recall a thread
a while ago about a Haskell specific OpenGL performance issue, but I can't find it now.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Here's the code https://gist.github.com/MichaelBaker/4429c93f2aca04bc79bb. I think I have everything important in there. Line 53 is the one that causes the slow down. I guess some things to note are that "Triangle" is Storable and the vector I'm creating on line 14 and writing to on line 47 is a mutable storable vector.
I think I know what is causing the slow down you're seeing. I want to point out some ways that you can get better help in the future. Perhaps you already know these things and you were in a hurry or something like that. In that case, I still want to point out these tips anyway for the benefit of others :) First: Thank you for posting the code, but please understand that I can't find withBasicWindow on hoogle or google and there isn't a single import in that code. How does it work? What is triangles? It's hard for me to help people if I don't have the complete code or know which libraries they're using. I'm sure others are in a similar situation. Second: OpenGL is a library/api for specifying the lighting, transformations, geometry, colors, raster effects, shaders and that sort of thing. OpenGL doesn't mention anything about making a window, putting pixels in the window, nor interfacing with the OS. On the other hand, performance problems don't discriminate and can happen at any level. In other words, you'll get much better help if you provide runnable examples. Yes that means more work for you, but I promise that being in the habit of making minimal/reproducible test cases will hugely improve your (or anyone's) skills as a software engineer. Furthermore, because I'm not testing with your code, anything I suggest should considered speculation. Third: In terms of graphics performance most people are accustomed to talking about FPS, but FPS is hard to work with. Think about this, if you can render one effect and get 60 FPS and you add another effect that renders at 30 FPS, what should be the resulting FPS? Instead, set a budget for rendering time by taking 1/(desired FPS), 60 FPS = 16.67 ms, and focus on how long each operation takes to render. At least then you can simply add the costs and figure out how much time you have left. Additionally, when you measure the FPS of rendering something there is usually a frame rate limit, say 60 FPS, so you might think your effect takes 16.7 ms but it really takes 2 ms. With that out of the way, I see that your `tris` list is 80 elements. If you're rendering 8000 triangles then are you saying you call that 100 times per frame? glDrawArray is slow per-call and is designed to work with a lot of data per-call. I haven't tried criterion with opengl before, but that's what I would do next. I'd use it to figure out the time per-call of your glDrawArray. Then you can get a sense of how many glDrawArray you can use as-per your rendering budget. I'd also look up the tricks that people use to reduce the number of individual calls to glDrawArray. Look up 'texture atlas'. I've never actually used it so don't trust my explanation :) Roughly, I think you are restricted to one texture per glDrawArray call so you 'cheat' by copying all your textures into one big texture and note the boundary points of each texture and provide those as the texture coordinates for your triangles as appropriate. You could probably make this nicer to work with by using Data.Map to map between your original texture id and the coordinates in the atlas. I hope that helps, Jason

On May 20, 2014, at 11:36 AM, Michael Baker
wrote: On Mon, May 19, 2014 at 11:43 PM, Anthony Cowley
wrote: On May 20, 2014, at 12:20 AM, Michael Baker
wrote: I'm using OpenGLRaw and I'm getting 3 frames per second trying to draw 8000 triangles. When I profile the application, I see that almost all of the time is taken by a call to CGLFlushDrawable, which is apparently OSX's function for swapping the back buffer to the front buffer.
Can you put the code somewhere so we can take a look? OpenGL offers 8000 ways to draw 8000 triangles.
Anthony
Has anyone else run into a problem like this? I seem to recall a thread a while ago about a Haskell specific OpenGL performance issue, but I can't find it now. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Here's the code https://gist.github.com/MichaelBaker/4429c93f2aca04bc79bb. I think I have everything important in there. Line 53 is the one that causes the slow down. I guess some things to note are that "Triangle" is Storable and the vector I'm creating on line 14 and writing to on line 47 is a mutable storable vector.
I would avoid setting the buffer data every frame like this. You should create your VBO once, then map it, update the data, and unmap it every frame if all your geometry really does change every frame. Don't use a list for your geometry if you can avoid it. Consider using VAOs to keep track of enabled attributes, etc. Unsurprisingly, I'd recommend taking a look at the vinyl-gl tutorial for most of those considerations, though I don't think I say anything about mapping buffers there. Anthony
Also, thanks Alp, I'll look into this when I get home. Although I don't think my computer has more than one graphics card (it's a Macbook Air).

I've added a VAO and reduced the inner loop to only a call to glDrawArrays which renders 8000 triangles (24000 vertices). The gameFrame function now benchmarks at ~86ms on my machine. I would expect it to be ~16ms because of vsync. I've pared down the entire project into this buildable cabal project of three files, most of which is OpenGL boilerplate: https://github.com/MichaelBaker/haskell-opengl. Running it benchmarks the problem area. For what it's worth, I also tried using glBufferSubData to update the vertex data, but that had no effect. As a side note, vinyl-gl looks like a nice library. I'll definitely look into using it if I can ever get this basic example working.

On Wed, May 21, 2014 at 6:53 PM, Michael Baker
I've added a VAO and reduced the inner loop to only a call to glDrawArrays which renders 8000 triangles (24000 vertices). The gameFrame function now benchmarks at ~86ms on my machine. I would expect it to be ~16ms because of vsync. I've pared down the entire project into this buildable cabal project of three files, most of which is OpenGL boilerplate: https://github.com/MichaelBaker/haskell-opengl.
Awesome. Thanks! I'm taking a look now. I'll let you know if I figure out anything. Jason

On Wed, May 21, 2014 at 9:59 PM, Jason Dagit
On Wed, May 21, 2014 at 6:53 PM, Michael Baker
wrote: I've added a VAO and reduced the inner loop to only a call to glDrawArrays which renders 8000 triangles (24000 vertices). The gameFrame function now benchmarks at ~86ms on my machine. I would expect it to be ~16ms because of vsync. I've pared down the entire project into this buildable cabal project of three files, most of which is OpenGL boilerplate: https://github.com/MichaelBaker/haskell-opengl.
Awesome. Thanks! I'm taking a look now. I'll let you know if I figure out anything.
I only have access to windows at the moment and I'm getting a weird linker error in bindings-GLFW when trying to build GLFW-b: Loading package bindings-GLFW-3.0.3.2 ... linking ... ghc.exe: unable to load package `bindings-GLFW-3.0.3.2' ghc.exe: warning: _vsnprintf from msvcrt is linked instead of __imp__vsnprintf ghc.exe: C:\Users\dagit\Documents\Repos\haskell-opengl\.cabal-sandbox\x86_64-windows-ghc-7.8.2\bindings-GLFW-3.0.3.2\HSbindings-GLFW-3.0.3.2.o: unknown symbol `strdup' Failed to install GLFW-b-1.4.6 I'll try on osx tomorrow.

On May 21, 2014, at 9:53 PM, Michael Baker
wrote: I've added a VAO and reduced the inner loop to only a call to glDrawArrays which renders 8000 triangles (24000 vertices). The gameFrame function now benchmarks at ~86ms on my machine. I would expect it to be ~16ms because of vsync. I've pared down the entire project into this buildable cabal project of three files, most of which is OpenGL boilerplate: https://github.com/MichaelBaker/haskell-opengl. Running it benchmarks the problem area. For what it's worth, I also tried using glBufferSubData to update the vertex data, but that had no effect.
As a side note, vinyl-gl looks like a nice library. I'll definitely look into using it if I can ever get this basic example working.
That's some pretty serious antialiasing you're asking for, there (16x). Is that intentional? Anthony

That's some pretty serious antialiasing you're asking for, there (16x). Is that intentional?
Anthony
Ah, that might be part of the problem. Lowering (or removing) that definitely helps, but I'm still getting ~79ms per frame on a Macbook Pro Retina with an Intel HD Graphics 4000 video card.

Em 21-05-2014 22:53, Michael Baker escreveu:
I've added a VAO and reduced the inner loop to only a call to glDrawArrays which renders 8000 triangles (24000 vertices). The gameFrame function now benchmarks at ~86ms on my machine. I would expect it to be ~16ms because of vsync. I've pared down the entire project into this buildable cabal project of three files, most of which is OpenGL boilerplate: https://github.com/MichaelBaker/haskell-opengl. Running it benchmarks the problem area. For what it's worth, I also tried using glBufferSubData to update the vertex data, but that had no effect.
Your program is segfaulting for me on Linux: $ cabal clean $ cabal configure --disable-executable-stripping --ghc-option=-debug [...] $ cabal build [...] $ LC_ALL=C valgrind ./dist/build/opengl/opengl ==21335== Memcheck, a memory error detector ==21335== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al. ==21335== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info ==21335== Command: ./dist/build/opengl/opengl ==21335== warming up estimating clock resolution... mean is 38.15961 us (20001 iterations) found 583 outliers among 19999 samples (2.9%) 328 (1.6%) high mild 254 (1.3%) high severe estimating cost of a clock call... mean is 1.177577 us (66 iterations) found 3 outliers among 66 samples (4.5%) 2 (3.0%) high mild 1 (1.5%) high severe benchmarking gameFrame ==21335== Invalid read of size 8 ==21335== at 0x4C2CB30: memcpy@GLIBC_2.2.5 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==21335== by 0x5C1E1A4: ??? (in /usr/lib/nvidia/libGL.so.337.19) ==21335== by 0x5DC2: ??? ==21335== by 0xFF2FFFFFFFF: ??? ==21335== by 0xFEF: ??? ==21335== by 0xFEF: ??? ==21335== by 0xAC4285F: ??? ==21335== by 0x5EF4BB5: _XSend (in /usr/lib/libX11.so.6.3.0) ==21335== by 0xAE456DF: ??? ==21335== by 0xAE460A7: ??? ==21335== by 0x5DC2F: ??? ==21335== by 0xAE455EF: ??? ==21335== Address 0xfe8 is not stack'd, malloc'd or (recently) free'd ==21335== ==21335== ==21335== Process terminating with default action of signal 11 (SIGSEGV) ==21335== Access not within mapped region at address 0xFE8 ==21335== at 0x4C2CB30: memcpy@GLIBC_2.2.5 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==21335== by 0x5C1E1A4: ??? (in /usr/lib/nvidia/libGL.so.337.19) ==21335== by 0x5DC2: ??? ==21335== by 0xFF2FFFFFFFF: ??? ==21335== by 0xFEF: ??? ==21335== by 0xFEF: ??? ==21335== by 0xAC4285F: ??? ==21335== by 0x5EF4BB5: _XSend (in /usr/lib/libX11.so.6.3.0) ==21335== by 0xAE456DF: ??? ==21335== by 0xAE460A7: ??? ==21335== by 0x5DC2F: ??? ==21335== by 0xAE455EF: ??? ==21335== If you believe this happened as a result of a stack ==21335== overflow in your program's main thread (unlikely but ==21335== possible), you can try to increase the size of the ==21335== main thread stack using the --main-stacksize= flag. ==21335== The main thread stack size used in this run was 8388608. ==21335== ==21335== HEAP SUMMARY: ==21335== in use at exit: 1,222,944 bytes in 665 blocks ==21335== total heap usage: 5,445 allocs, 4,780 frees, 78,290,922 bytes allocated ==21335== ==21335== LEAK SUMMARY: ==21335== definitely lost: 10,752 bytes in 12 blocks ==21335== indirectly lost: 0 bytes in 0 blocks ==21335== possibly lost: 519,934 bytes in 2 blocks ==21335== still reachable: 692,258 bytes in 651 blocks ==21335== suppressed: 0 bytes in 0 blocks ==21335== Rerun with --leak-check=full to see details of leaked memory ==21335== ==21335== For counts of detected and suppressed errors, rerun with: -v ==21335== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 3 from 2) Falha de segmentação (imagem do núcleo gravada) $ gdb ./dist/build/opengl/opengl GNU gdb (GDB) 7.7.1 [...] Reading symbols from ./dist/build/opengl/opengl...(no debugging symbols found)...done. (gdb) run Starting program: /tmp/haskell-opengl/dist/build/opengl/opengl warning: Could not load shared library symbols for linux-vdso.so.1. Do you need "set solib-search-path" or "set sysroot"? [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". warming up estimating clock resolution... mean is 1.346329 us (640001 iterations) found 4071 outliers among 639999 samples (0.6%) 3860 (0.6%) high severe estimating cost of a clock call... mean is 81.51352 ns (10 iterations) found 1 outliers among 10 samples (10.0%) 1 (10.0%) high mild benchmarking gameFrame Program received signal SIGSEGV, Segmentation fault. 0x00007ffff5c5ece0 in _wordcopy_fwd_aligned () from /usr/lib/libc.so.6 (gdb) bt #0 0x00007ffff5c5ece0 in _wordcopy_fwd_aligned () from /usr/lib/libc.so.6 #1 0x00007ffff5c595f5 in __memmove_sse2 () from /usr/lib/libc.so.6 #2 0x00007ffff6dff1a5 in ?? () from /usr/lib/libGL.so.1 #3 0x00007ffff6e01ada in ?? () from /usr/lib/libGL.so.1 #4 0x0000000000407b91 in ceGc_info () #5 0x0000000000000000 in ?? () Cheers, -- Felipe.

Michael Baker wrote:
I've added a VAO and reduced the inner loop to only a call to glDrawArrays which renders 8000 triangles (24000 vertices). The gameFrame function now benchmarks at ~86ms on my machine. I would expect it to be ~16ms because of vsync.
You're expecting too much from your graphics hardware. The total area of the triangles that you are drawing covers approximately 80 million pixels (they overlap a lot). If one uses non-overlapping triangles (with a total area close to 60k pixels) floats = concat $ [[0, 0, 0, cos ((x+1) / triangles), sin ((x+1) / triangles), 0, cos (x / triangles), sin (x / triangles), 0] | x <- [0..triangles]] :: [GLfloat] the program becomes a lot faster. Code: let area [] = 0; area (x1:y1:_:x2:y2:_:x3:y3:_:xs) = ((x3-x1)*(y2-y1) - (x2-x1)*(y3-y1))/2*400*300 + area xs in area floats Cheers, Bertram

You're expecting too much from your graphics hardware. The total area of the triangles that you are drawing covers approximately 80 million pixels (they overlap a lot). If one uses non-overlapping triangles (with a total area close to 60k pixels)
Ah, ok. So the problem is that I'm trying to render too many fragments, rather than too many triangles. So if I had a lot of small triangles, or a few big ones, I would be fine. However, I've got a lot of big ones, which is slow. I'll try reducing the total area of the triangles I'm rendering. Is there some benchmark or tool I could have used to figure that out? Something that would show me the time spent filling fragments vs the time spent processing triangles vs time spent uploading data to the graphics card?

On Thu, May 22, 2014 at 1:29 PM, Michael Baker
You're expecting too much from your graphics hardware. The total area of
the triangles that you are drawing covers approximately 80 million pixels (they overlap a lot). If one uses non-overlapping triangles (with a total area close to 60k pixels)
Ah, ok. So the problem is that I'm trying to render too many fragments, rather than too many triangles. So if I had a lot of small triangles, or a few big ones, I would be fine. However, I've got a lot of big ones, which is slow. I'll try reducing the total area of the triangles I'm rendering. Is there some benchmark or tool I could have used to figure that out? Something that would show me the time spent filling fragments vs the time spent processing triangles vs time spent uploading data to the graphics card?
This looks like a decent list of options: http://www.opengl.org/wiki/Debugging_Tools This SO question looks promising: http://stackoverflow.com/questions/12640841/did-someone-succeed-in-using-ope... In particular, I think you can attach the OSX opengl profiler to a process: https://developer.apple.com/library/mac/documentation/GraphicsImaging/Concep...

For the sake of posterity I want to let everyone know that decreasing the anti-aliasing samples and decreasing the total visible area being rendered solved the problem. Thank you everyone for your help!

2014-05-22 22:29 GMT+02:00 Michael Baker
[...] Is there some benchmark or tool I could have used to figure that out? Something that would show me the time spent filling fragments vs the time spent processing triangles vs time spent uploading data to the graphics card?
I don't think there are general cross-platform tools for this, but e.g. if you have NVIDIA hardware, your platform is supported and you go through the initial pain of installing/learning the tool, NVIDIA Nsight or PerfKit can quickly answer such questions. No idea if something similar exists for AMD or Intel GPUs, but it's likely. Apart from that, there are few rules of thumb and techniques to determine the bottleneck in your rendering pipeline: Vary the size of the window you're drawing to and see if performance changes. If yes, you are probably limited by the fill rate of your GPU. Another test: Keep the window size, but vary the complexity of the geometry. If performance changes, it could be the calculation of the geometry on the CPU or the transformation of the geometry on the GPU (depending on how you do things). You could even calculate and vary the geometry, but don't actually send it for rendering to see the effect of your CPU, you could play some OpenGL tricks to measure/visualize the amount of overdraw etc. etc.

Maybe you need to disable the hot switching between the default graphics
card and the more powerful one? I think that was the source of a few
similar issues [1].
[1]: http://gloss.ouroborus.net, question *Q: On my MacBook Pro under OSX,
gloss programs freeze after displaying the first few frames*
On Tue, May 20, 2014 at 6:20 AM, Michael Baker
I'm using OpenGLRaw and I'm getting 3 frames per second trying to draw 8000 triangles. When I profile the application, I see that almost all of the time is taken by a call to CGLFlushDrawable, which is apparently OSX's function for swapping the back buffer to the front buffer.
Has anyone else run into a problem like this? I seem to recall a thread a while ago about a Haskell specific OpenGL performance issue, but I can't find it now.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Alp Mestanogullari
participants (8)
-
Alp Mestanogullari
-
Anthony Cowley
-
Anthony Cowley
-
Bertram Felgenhauer
-
Felipe Lessa
-
Jason Dagit
-
Michael Baker
-
Sven Panne