I read that the APIs like glDrawElementsIndirect
, glDrawArraysIndirect
help us in indirect rendering. Indirect rendering is different from direct in the way that the rendering parameters like "number of vertex attributes", "number of instances to draw", "starting vertex attribute from buffer object" etc are provided in a buffer object by the GPU itself rather than being provided by the CPU in the draw call.
I understood that. It also explained that the advantage is that it gets rendered faster because there is no CPU interaction involved. But wait, wasn't it the CPU that actually made the render call? It still specified the rendering mode (GL_TRIANGLES
etc). It also possibly loaded the vertex attributes.
So is all the perf gain in indirect rendering being accounted for by just not having to pass these tiny variables : "count", "primitive count", "first vertex attribute", "instance count" ? This doesn't make much sense to me. (It is not changing any state either)
Indirect drawing enables some scene-traversal and culling to be moved from the CPU to the GPU, which can improve performance. The command buffer can be generated by the CPU or GPU.
The multi draw indirect extensions allow multiple sets of DrawInstancedIndirect to be submitted in one API call. The draw calls are issued on the GPU's command processor (CP), potentially saving the significant CPU overheads incurred by submitting the equivalent draw calls on the CPU.
Instancing, or instanced rendering, is a way of executing the same drawing commands many times in a row, with each producing a slightly different result. This can be a very efficient method of rendering a large amount of geometry with very few API calls.
The performance gain is often not so much due to passing some small variable like "count" or "instance count", but due to knowing these. In order to know these values, you must do a round trip to the CPU, which is only possible after the result is available, i.e. after a server sync (plus it adds the latency of the bus).
Say you are using transform feedback with a geometry shader. This means no matter what you feed in, you don't really know what comes out on the other end, not before the batch has finished and you've queried the counts, anyway.
Indirect rendering addresses this, you don't need to know and actually you don't want to know. The information goes into a buffer object, and the GPU can access it without your intervention.
That's analogous to conditional rendering. Actually you could skip the whole thing of conditional rendering, couldn't you. Instead of submitting commands to the command queue that will maybe not get executed (how inefficient!), you could run your occlusion query and see whether it passes or not, and then decide whether to submit those objects that you want to draw.
Except this means you must wait until the query (and thus the previous batch) is finished, sync, and do a PCIe transfer before making this decision. During this time, the GPU likely stalls, and then you've still not set up the right buffers/textures and submitted commands. In reality, it is therefore much more efficient to speculatively submit commands and let the driver/GPU decide whether to discard them or whether to draw them.
That's also the idea behind ARB_query_buffer_object
, which lets you read a query result into a buffer object.
EDIT:
Also, indirect rendering allows for much more efficient submission of render command batches (especially in combination with persistent mappings) which may avoid much or all of the server/client and CPU/GPU synchronizations normally present and may come from another processor core and saves the per-drawcall fixed overhead. See pages 62 onward in Cass Everitt's talk.
In direct rendering the CPU is occupied with preparing and streaming the index data out of its own memory, over a bus with limited bandwidth to the GPU. It must check for GPU state and synchronize with it. Each of those steps is time consuming.
Using indirect rendering all the CPU does is sending one short command, that kicks off a large batch of drawing operations. This saves bus bandwidth. And because the GPU will do work for a longer time span, there's less interruptions that force the CPU to stop whatever it is doing right now (context switch), which means, that complex numeric tasks, like physics simulations will execute more performant.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With