Using different streams for CUDA kernels makes concurrent kernel execution possible. Therefore n
kernels on n
streams could theoretically run concurrently if the they are fitting into the hardware, right?
Now I'm facing the following problem: There are not n
distinct kernels but n*m
where m
kernels need to be executed in order. For instance n=2
and m=3
would lead to the following execution scheme with streams:
Stream 1: <<<Kernel 0.1>>> <<<Kernel 1.1>>> <<<Kernel 2.1>>>
Stream 2: <<<Kernel 0.2>>> <<<Kernel 1.2>>> <<<Kernel 2.2>>>
My naive assumption is that the kernels x.0 and y.1 should execute concurrently (from a theoretic point of view) or at least not consecutively (from a practical point of view). But my measurements are showing me that this is not the case and it seems that consecutive execution is performed (i. e. K0.0, K1.0, K2.0, K0.1, K1.1, K2.1). The kernels itself are very small, so concurrent execution should not be a problem.
Now my approach would be to accomplish a kind of dispatching for making sure that the kernels are en-queued in an interleaved style into the scheduler on the GPU. But when dealing with a large number of streams / kernels this could do more harm than good.
Alright, coming straight to the point: What would be an appropriate (or at least different) approach to solve this situation?
Edit: Measurements are done by using CUDA events. I've measured the time that is needed to fully solve the computation, i. e. the GPU has to compute all n * m
kernels. The assumption is: On fully concurrent kernel execution the execution time is roughly (ideally) 1/n
times of the time that is needed to execute all kernels in order, whereby it must be possible that two or more kernels can be executed concurrently. I'm ensuring this by only using two distinct streams right now.
I can measure a clear difference regarding execution times between using the streams as described and dispatching kernels interleaved, i. e.:
Loop: i = 0 to m
EnqueueKernel(Kernel i.1, Stream 1)
EnqueueKernel(Kernel i.2, Stream 2)
versus
Loop: i = 1 to n
Loop: j = 0 to m
EnqueueKernel(Kernel j.i, Stream i)
The latter leads to a longer runtime.
Edit #2: Changed the Stream numbers to begin by 1 (instead of 0, see comments below).
Edit #3: Hardware is a NVIDIA Tesla M2090 (i. e. Fermi, compute capability 2.0)
First, it gives each host thread its own default stream. This means that commands issued to the default stream by different host threads can run concurrently. Second, these default streams are regular streams. This means that commands in the default stream may run concurrently with commands in non-default streams.
kernel cannot allocate, and only isbits types in device arrays: CUDA C has no garbage collection, and Julia has no manual deallocations, let alone on the device to deal with data that live independently of the CuArray. no try-catch-finally in kernel: CUDA C does not support exception handling on device (v11.
Dynamic Parallelism in CUDA 5.0 enables a CUDA kernel to create and synchronize new nested work, using the CUDA runtime API to launch other kernels, optionally synchronize on kernel completion, perform device memory management, and create and use streams and events, all without CPU involvement.
In CUDA, kernel launches are asynchronous (often called “non-blocking”). An example of kernel execution from host perspective: 1. Host call starts the kernel execution.
On Fermi (aka Compute Capability 2.0) hardware it is best to interleave kernel launches to multiple streams rather than to launch all kernels to one stream, then the next stream, etc. This is because the hardware can immediately launch kernels to different streams if there are sufficient resources, whereas if subsequent launches are to the same stream there is often delay introduced, reducing concurrency. This is the reason that your first approach performs better, and this approach is the one you should choose.
Enabling profiling can also disable concurrency on Fermi, so be careful with that. Also, be careful about using CUDA events during your launch loop, as these can interfere -- best to time the whole loop using events as you are doing, for example.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With