All the time till now I had 3D objects created during the startup. But now I need to add them dynamically. What can be simpler, I thought...
The main issue right now is how to upload the new object's data in the fastest way and find out when the data is uploaded.
Here's my setup:
VkBuffer for every object - this way I don't need to manage offsets, alignments and it would be easier to add/remove objects.And here are my thoughts/questions:
draw command when this fence is ready.I'm pretty sure I'm overcomplicating. I believe there should be a much simpler pattern. How would you do this?
I believe there should be a much simpler pattern.
It's Vulkan; it's an explicit, low-level API. "Simple" is not its goal.
Overall, your Vulkan code needs to be written to adapt to the capabilities of the hardware. That's the best way to get performance out of it.
The first decision that needs to be made is whether you need staging at all. Staging (for buffer copies) is only necessary if your device's DEVICE_LOCAL memory is not mappable. And yes, there are (integrated) GPUs that allow you to map DEVICE_LOCAL memory. If that is the case, then you can just write directly to where you need the data to go.
If staging is needed, then you need to decide if the hardware supports an independent transfer-only queue. If so, then you will likely get performance benefits by employing it. Not all hardware supports transfer-only queues, so your application needs to adapt. Also, transfer-only queues can have restrictions on the granularity of memory transfers taking place on those queues, so you need to check to see if your streaming strategy fits within the limits of that particular hardware.
Also, if there is no appropriate transfer queue, you can create the effect of a transfer queue by using a second compute or graphics queue... if the hardware supports multiple queues at all. Being able to submit transfer commands and rendering commands on different queues is a good thing, assuming you are taking advantage of threading (ie: issuing submits of the batches to the different queues on different threads).
If you are able to use a separate queue for transfers (whether a true transfer queue or just a separate compute/graphics queue), then you get to play around with semaphores. The batch that transfers data must signal a semaphore when it completes; this is part of the batch in the vkQueueSubmit call. The batch on the main queue that uses the transferred data for some process needs to wait on that semaphore. So both threads need to be using the same VkSemaphore object. And the wait on the semaphore should just have a global memory barrier, to make the memory visible.
The tricky part is this: you cannot submit the batch that waits on the semaphore until the submit call for the batch that signals it has been submitted. You don't have to wait until completion, but you do have to wait until the vkQueueSubmit call on the transfer queue has returned. So you need a way to transfer the semaphore between different threads, or you could just issue both submit commands on the same thread.
If you aren't using a second queue, then things are slightly simpler.
You still want to build the transfer command buffer itself on a different thread (to take advantage of threading CB construction). But that CB now needs to be communicated to the thread responsible for submitting the rendering stuff. And this channel of communication needs to know that this CB contains transfer commands, which some of the rendering CB processes ought to wait on.
The simplest and most flexible way to do this is to build the transfer CB so that the last command is a vkCmdSetEvent command (and the first command is a vkCmdResetEvent to reset it from previous frames of usage). The submission thread then only needs to create a small CB that only contains a vkCmdWaitEvents command which waits on the transfer event that will be set. That command should issue a full memory barrier, and that CB should execute between the transfer CB and any rendering CBs that read from the transferred data.
The flexibility of this is in the structure of the process. It is structured similarly to how the multi-queue version works. In both cases, a separate thread needs to communicate something to the render submission thread (in one case, a semaphore; in the other, a CB and an event). And the render submission thread needs to do things to wait on that "something", but without disrupting the process of building the rendering commands itself (in one case, you just change the batch to wait on the semaphore; in the other, you insert a CB that waits for the event).
If you want to get a bit smarter about execution dependencies, you can even have the transfer operation forward information about which pipeline stages need to wait on the operation. But that's mostly an optimization.
Here's the thing though: all of the staging cases are not performance-friendly. They're problematic because you can't do anything while the transfer operation is going on. And that is the case because... you're trying to read from the memory in the same frame you're writing to it. That's bad.
You should endeavor instead to delay rendering any objects for which loading is not complete. Or put another way, you want to load the data for new objects before you need them, not on the same frame you need them. This is what streaming systems do: they pre-emptively load data that will be needed soon, but not right now.
But how big?
Only you and your use cases can answer that question. If you are streaming in fixed-sized blocks (which you should do where possible), then it's fairly easy: your staging buffer should be one or maybe two streaming blocks in size. If your rendering system is more flexible, imposing few limitations on the higher-level code, then your staging buffer and your streaming system needs to be more flexible. And there's no right answer for that; it depends entirely on how it gets used.
Welcome to using explicit, low-level APIs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With