Consider the typical "naive" vertex shader:
in vec3 aPos;
uniform mat4 uMatCam;
uniform mat4 uMatModelView;
uniform mat4 uMatProj;
void main () {
gl_Position = uMatProj * uMatCam * uMatModelView * vec4(aPos, 1.0);
}
Of course, conventional wisdom would suggest "there are three mat4s multiplied for each vertex, two of which are uniform even across multiple subsequent glDrawX() calls within the current shader program, at least these two should be pre-multiplied CPU-side, possibly even all three."
I'm wondering whether modern-day GPUs have optimized this use-case to a degree where CPU-side premultiplication is no longer a performance benefit. Of course, the purist might say "it depends on the end-user's OpenGL implementation" but for this use-case we can safely assume it'll be a current-generation OpenGL 4.2-capable nVidia or ATI driver providing that implementation.
From your experience, considering we might be "Drawing" a million or so vertices per UseProgram() pass -- would pre-multiplying at least the first two (perspective-projection and camera-transform matrices) per UseProgram() boost performance to any significant degree? What about all three per Draw() call?
Of course, it's all about benchmarking... but I was hoping someone has fundamental, current-gen hardware-implementation-based insights I'm missing out on that might suggest either "not even worth a try, don't waste your time" or "do it by all means, as your current shader without pre-multiplication would be sheer insanity"... Thoughts?
I'm wondering whether modern-day GPUs have optimized this use-case to a degree where CPU-side premultiplication is no longer a performance benefit.
GPUs work best in parallel operations. The only way "GPUs" can optimize three sequential vector/matrix multiplies like this is if the shader compiler detects that they're uniforms and does the multiplications itself somewhere when you issue a draw call, passing the shader the results.
So in either case, the 3 matrix multiplies become 1 in the shader. You can either do those multiplication yourself or not. And the driver can either implement this optimization or not. Here's a diagram of the possibilities:
| GPU optimizes | GPU doesn't optimize
------------|----------------|---------------------
You send 3 | Case A | Case B
matrices | |
---------------------------------------------------
You multiply| Case C | Case D
on the CPU | |
------------|----------------|---------------------
In Case A, you get better performance than your code would suggest. In case B, you don't get the better performance.
Both Cases C and D guarantee-ably give you the same performance as Case A.
The question isn't whether drivers will implement this optimization. The question is, "what is that performance worth to you?" If you want that performance, then it behooves you to do it yourself; that's the only way to reliably achieve that performance. And if you don't care about the performance... what does it matter?
In short, if you care about this optimization, do it yourself.
From your experience, considering we might be "Drawing" a million or so vertices per UseProgram() pass -- would pre-multiplying at least the first two (perspective-projection and camera-transform matrices) per UseProgram() boost performance to any significant degree? What about all three per Draw() call?
It might; it might not. It all rather depends on how vertex transform bottlenecked your rendering system is. There is no way to know without doing testing in the actual rendering environment.
Also, combining the projection and camera matrices isn't the best idea, since that would mean doing lighting in world space rather than camera space. It also makes deferred rendering that much harder, since you don't have a pure projection matrix to pull values out of.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With