Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Underlying hardware mapping of Vulkan queues

Tags:

gpu

vulkan

Vulkan is intended to be thin and explicit to user, but queues are a big exception to this rule: queues may be multiplexed by driver and it's not always obvious if using multiple queues from a family will improve performance or not.

After one of driver updates, I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them (will be happy to be proved wrong)

So why not just say "we have N separate hardware queues and if you want to use some of them in parallel, just mutex it yourself"? Now it looks like there's no way to know, how independent queues in family really are.

like image 884
YaaZ Avatar asked Mar 03 '23 05:03

YaaZ


1 Answers

GPUs these days have to contend with a multi-processed world. Different programs can access the same hardware, and GPUs have to be able to deal with that. As such, having parallel input streams for a single piece of actual hardware is no different from being able to create more CPU threads than you have actual CPU cores.

That is, a queue from a family is probably not "mutexing" access to the actual hardware. At least, not in a CPU way. If multiple queues from a family are different paths to execute stuff on the same hardware, then the way that hardware gets populated from these multiple queues probably happens at the GPU level. That is, it's an actual hardware feature.

And you could never get performance equivalent to that hardware feature by "mutexing it yourself". For example:

I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them

Let's assume that there really is only one hardware DMA channel with a fixed bandwidth behind that transfer queue. This means that, at any one time, only one thing can be DMA'd from CPU memory to GPU memory at one time.

Now, let's say you have some DMA work to do. You want to upload a bunch of stuff. But every now and then, you need to download some rendering product. And that download needs to complete ASAP, because you need to reuse the image that stores those bytes.

With prioritized queues, you can give the download transfer queue much higher priority than the upload queue. If the hardware permits it, then it can interrupt the upload to perform the download, then get back to the upload.

With your way, you'd have to upload each item one at a time at regular intervals. A process that will have to be able to be interrupted by a possible download. To do that, you'd basically have to have a recurring tasks that shows up to perform and submit a single upload to the transfer queue.

It'd be much more efficient to just throw the work at the GPU and let its priority system take care of it. Even if there is no priority system, then it'll probably perform operations round-robin, jumping back and forth between the input transfer queue operations rather than waiting for one queue to run dry before trying another.

But of course, this is all hypothetical. You'd need to do profiling work to make sure that these things pan out.

The main issue with queues within families is that they sometimes represent distinct hardware with their own dedicated resources and sometimes they don't. AMD's hardware for example offers two transfer queues, but these actually use separate DMA channels. Granted, they probably still share the same overall bandwidth, but it's not a simple case of one queue having to wait to execute work until the other queue has executed a transfer command.

like image 181
Nicol Bolas Avatar answered Mar 28 '23 11:03

Nicol Bolas