I have a vertex buffer that is stored in a device memory and a buffer and is host visible and host coherent.
To write to the vertex buffer on the host side I map it, memcpy to it and unmap the device memory.
To read from it I bind the vertex buffer in a command buffer during recording a render pass. These command buffers are submitted in a loop that acquires, submits and presents, to draw each frame.
Currently I write once to the vertex buffer at program start up.
The vertex buffer then remains the same during the loop.
I'd like to modify the vertex buffer between each frame from the host side.
What I'm not clear on is the best/right way to synchronize these host-side writes with the device-side reads. Currently I have a fence and pair of semaphores for each frame allowed simulatenously in flight.
For each frame:
I wait on the fence.
I reset the fence.
The acquire signals semaphore #1.
The queue submit waits on semaphore #1 and signals semaphore #2 and signals the fence.
The present waits on semaphore #2
Where is the right place in this to put the host-side map/memcpy/unmap and how should I synchronize it properly with the device reads?
Buffers in Vulkan are regions of memory used for storing arbitrary data that can be read by the graphics card. They can be used to store vertex data, which we'll do in this chapter, but they can also be used for many other purposes that we'll explore in future chapters.
What are “staging resources” or “staging buffers”? They are intermediate or temporary resources used to transfer data from an application (CPU) to a graphics card's memory (GPU).
If you want to take advantage of asynchronous GPU execution, you want the CPU to avoid having to stall for GPU operations. So never wait on a fence for a batch that was just issued. The same thing goes for memory: you should never desire to write to memory which is being read by a GPU operation you just submitted.
You should at least double-buffer things. If you are changing vertex data every frame, you should allocate sufficient memory to hold two copies of that data. There's no need to make multiple allocations, or even to make multiple VkBuffer
s (just make the allocation and buffers bigger, then select which region of storage to use when you're binding it). While one region of storage is being read by GPU commands, you write to the other.
Each batch you submit reads from certain memory. As such, the fence for that batch will be set when the GPU is finished reading from that memory. So if you want to write to the memory from the CPU, you cannot begin that process until the fence representing the GPU reading operation for that memory reading gets set.
But because you're double buffering like this, the fence for the memory you're about to write to is not the fence for the batch you submitted last frame. It's the batch you submitted the frame before that. Since it's been some time since the GPU received that operation, it is far less likely that the CPU will have to actually wait. That is, the fence should hopefully already be set.
Now, you shouldn't do a literal vkWaitForFences
on that fence. You should check to see if it is set, and if it isn't, go do something else useful with your time. But if you have nothing else useful you could be doing, then waiting is probably OK (rather than sitting and spinning on a test).
Once the fence is set, you know that you can freely write to the memory.
How do I know that the memory I have written to with the memcpy has finished being sent to the device before it is read by the render pass?
You know because the memory is coherent. That is what VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
means in this context: host changes to device memory are visible to the GPU without needing explicit visibility operations, and vice-versa.
Well... almost.
If you want to avoid having to use any synchronization, you must call vkQueueSubmit
for the reading batch after you have finished modifying the memory on the CPU. If they get called in the wrong order, then you'll need a memory barrier. For example, you could have some part of the batch wait on an event set by the host (through vkSetEvent
), which tells the GPU when you've finished writing. And therefore, you could submit that batch before performing the memory writing. But in this case, the vkCmdWaitEvents
call should include a source stage mask of HOST
(since that's who's setting the event), and it should have a memory barrier whose source access flag also includes HOST_WRITE
(since that's who's writing to the memory).
But in most cases, it's easier to just write to the memory before submitting the batch. That way, you avoid needing to use host/event synchronization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With