How to execute parallel compute shaders across multiple compute queues in Vulkan?

Tags:

Update: This has been solved, you can find further details here: https://stackoverflow.com/a/64405505/1889253

A similar question was asked previously, but that question was initially focused around using multiple command buffers, and triggering the submit across different threads to achieve parallel execution of shaders. Most of the answers suggest that the solution is to use multiple queues instead. The use of multiple queues also seems to be the consensus across various blog posts and Khronos forum answers. I have attempted those suggestions running shader executions across multiple queues but without being able to see parallel execution, so I wanted to ask what I may be doing wrong. As suggested, this question includes the runnable code of multiple compute shaders being submitted to multiple queues, which hopefully can be useful for other people looking to do the same (once this is resolved).

The current implementation is in this pull request / branch, however I will cover the main Vulkan specific points, to ensure only Vulkan knowledge is required to answer this question. It's also worth mentioning that the current use-case is specifically for compute queues and compute shaders, not graphics or transfer queues (although insights/experience achieving parallelism across those would still be very useful, and would most probably also lead to the answer).

More specifically, I have the following:

Multiple queues first are "fetched" - my device is a NVIDIA 1650, and supports 16 graphics+compute queues in queue family index 0, and 8 compute queues in queue family index 2
evalAsync performs the submission (which contains recorded shader commands) - You should notice that a fence is created which we'll be able to use. Also the submit doesn't have any waitStageMasks (PipelineStageFlags).
evalAwait allows us to wait for the fence - When calling the evalAwait, we are able to wait for the submission to finish through the created fence

A couple of points that are not visible in the examples above but are important:

All evalAsync run on the same application, instance and device
Each evalAsync executes with its own separate commandBuffer and buffers, and in a separate queue
If you are wondering whether memory barriers could be having something to do, we have tried by removing all memoryBarriers (this on for example that runs before shader execution) completely but this has not made any difference on performance

The test that is used in the benchmark can be found here, however the only key things to understand are:

This is the shader that we use for testing, as you can see, we just add a bunch of atomicAdd steps to increase the amount of processing time
Currently the test has small buffer size and high number of shader loop iterations, but we also tested with large buffer size (i.e. 100,000 instead of 10), and smaller iteration (1,000 istead of 100,000,000).

When running the test, we first run a set of "synchronous" shader executions on the same queue (the number is variable but we've tested with 6-16, the latter which is the max number of queues). Then we run these in an asychrnonous manner, where we run all of them and the evalAwait until they are finished. When comparing the resulting times from both approaches, they take the same amount of time eventhough they run across different compute queues.

My questions are:

Am I currently missing something when fetching the queues?
Are there further parameters in the vulkan setup that need to be configured to ensure asynchronous execution?
Are there any restrictions I may not be aware about around potentially operating system processes only being able to submit GPU workloads in a synchronous way to the GPU?
Would multithreading be required in order for parallel execution to work properly when dealing with multiple queue submissions?

Furthermore I have found several useful resources online across various reddit posts and Khronos Group forums that provide very in-depth conceptual and theoretical overviews on the topic, but I haven't come across end to end code examples that show parallel execution of shaders. If there are any practical examples out there that you can share, which have funcioning parallel execution of shaders, that would be very helpful.

If there are further details or questions that can help provide further context please let me know, happy to answer them and/or provide more detail.

For completeness, my tests were using:

Vulkan SDK 1.2
Windows 10
NVIDIA 1650

Other relevant links that have been shared in similar posts:

Similar discussion with suggested link to example but which seems to have disappeared...
Post on Leveraging asynchronous queues for concurrent execution (unfortunately no example code)
(Relatively old - 5 years) Post that suggests nvidia cards can't do parallel execution of shaders, but doesn't seem to have a conculsive answer
Nvidia presentation on Vulkan Multithreading with multiple queue execution (hence my question above on threads)

707

asked Oct 16 '20 07:10

axsauze

2 Answers

You are getting "asynchronous execution". You just don't expect it to behave the way it behaves.

On a CPU, if you have one thread active, then you're using one CPU core (or hyper-thread). All of that core's execution and computation capabilities are given to your thread alone (ignoring pre-emption). But at the same time, if there are other cores, your one thread cannot use any of the computational resources of those cores. Not unless you create another thread.

GPUs don't work that way. A queue is not like a CPU thread. It does not specifically relate to a particular quantity of computational resources. A queue is merely the interface through which commands get executed; the underlying hardware decides how to farm out commands to the various compute resources provided by the GPU as a whole.

What generally happens when you execute a command is that the hardware attempts to fully saturate the available shader execution units using your command. If there are more shader units available than the number of invocations your operation requires, then some resources are available immediately for the next command. But if not, then the entire GPU's compute resources will be dedicated to executing the first operation; the second one must wait for resources to become available before it can start.

It doesn't matter how many compute queues you shove work into; they're all going to try to use as many compute resources as possible. So they will largely execute in some particular order.

Queue priority systems exist, but these mainly help determine the order of execution for commands. That is, if a high-priority queue has some commands that need to be executed, then they will take priority the next time compute resources become available for a new command.

So submitting 3 dispatch batches on 3 separate queues is not going to complete faster than submitting 1 batch on one queue containing 3 dispatch operations.

The main reason multiple queues (of the same family) exist is to be able to submit work from multiple threads without having them do inter-thread synchronization (and to provide some possible prioritization of submissions).

115

answered Sep 19 '22 04:09

Nicol Bolas

I have been able to solve using this suggestion. To provide further context, I was trying to submit commands to multiple queues within the same family, however it was pointed out in the suggestion linked, NVIDIA (and other GPU vendors) have a varying range of capabilities when it comes to parallel processing of command submissions.

In my particular case, the NVIDIA 1650 card I was testing with, only supports concurrent processing when workloads are submitted in different queueFamilies - more specifically, it is only able to support one concurrent command submission across one Graphics queue and one compute family queue.

I re-implemented the code to allow for allocation of family queues for specific commands, and I was able to achieve parallel processing (with a 2x speed improvement by submitting across two queueFamilies).

Here is further detail on the implementation https://kompute.cc/overview/async-parallel.html

answered Sep 19 '22 04:09

axsauze

Related questions
                            
                                Pure virtual function implementation through lambda
                            
                                How to avoid same sequences of random numbers
                            
                                Definedness of pointer-integer casts
                            
                                Can you (and should you) disambiguate a function call taking T and const reference to T?
                            
                                Is there any reason to wrap a Lambda in a named function?
                            
                                overloading the operator ->
                            
                                Using std::enable_if with out-of-line member functions and templated static member conditions
                            
                                C++: Cannot initialize enum value from a constant of the same type
                            
                                std::accumulate using the view std::ranges::views::values
                            
                                C++20 Concepts: Difference in the behavior of the compound requirement expression with a pointer-type member in GCC and Clang
                            
                                How to disable MFC Edit control popup menu additional items?
                            
                                Does reinterpret_cast with uint8_t break the Strict Aliasing Rule?
                            
                                How to make a C++ class gdb-friendly?
                            
                                Are the following 3 ways to define objects identical?
                            
                                Need help understanding this line in an FFT algorithm
                            
                                C++ std::memory_order_relaxed confusion
                            
                                Why is clang dereferencing a parameter on every use?
                            
                                Differences between a pointer and a reference in Rust
                            
                                SFINAE works differently in cases of type and non-type template parameters
                            
                                Insert vector for value in map in C++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to execute parallel compute shaders across multiple compute queues in Vulkan?

Tags:

c++

gpgpu

gpu

vulkan

compute-shader

axsauze

People also ask

2 Answers

Nicol Bolas

axsauze

Recent Activity

Donate For Us