For the application that I'm currently developing, I want to have a long kernel (that is, a kernel that takes long to finish relative to the others) to execute concurrently with a sequence of multiple shorter kernels that also run concurrently. What makes this more complicated however, is the fact that the four shorter kernels each need to be synchronised after they're done, in order to execute another short kernel that collects and processes the data output by the other short kernels.
The following is a schematic of what I have in mind, with the numbered green bars representing different kernels:
In order to achieve this, I have written code that looks somewhat like the following:
// definitions of kernels 1-6
class Calc
{
Calc()
{
// ...
cudaStream_t stream[5];
for(int i=0; i<5; i++) cudaStreamCreate(&stream[i]);
// ...
}
~Calc()
{
// ...
for(int i=0; i<5; i++) cudaStreamDestroy(stream[i]);
// ...
}
void compute()
{
kernel1<<<32, 32, 0, stream[0]>>>(...);
for(int i=0; i<20; i++) // this 20 is a constant throughout the program
{
kernel2<<<1, 32, 0, stream[1]>>>(...);
kernel3<<<1, 32, 0, stream[2]>>>(...);
kernel4<<<1, 32, 0, stream[3]>>>(...);
kernel5<<<1, 32, 0, stream[4]>>>(...);
// ?? synchronisation ??
kernel6<<<1, 32, 0, stream[1]>>>(...);
}
}
}
int main()
{
// preparation
Calc C;
// run compute-heavy function as many times as needed
for(int i=0; i<100; i++)
{
C.compute();
}
// ...
return 0;
}
Note: the amount of blocks, threads and shared memory are just arbitrary numbers.
Now, how would I go about properly synchronising kernels 2–5 every iteration? For one, I don't know which of the kernels will take the longest to complete, as this may depend on user input. Furthermore, I've tried using cudaDeviceSynchronize()
and cudaStreamSynchronize()
, but those more than trebled the total execution time.
Are Cuda events perhaps the way to go? If so, how should I apply them? If not, what would be the proper way to do this?
Thank you very much.
There are two comments that need to be made first.
Launching small kernels (one block) is generally not the way to get good performance out of the GPU. Likewise kernels with a small number of threads per block (32) will generally impose an occupancy limit which will prevent full performance from the GPU. Launching multiple concurrent kernels doesn't mitigate this second consideration. I'll not spend any further time here since you've said the numbers are arbitrary (but see the next comment below).
Witnessing actual kernel concurrency is hard. We need kernels with a relatively long execution time but a relatively low demand on GPU resources. A kernel of <<<32,32>>>
could possibly fill the GPU you are running on, preventing any ability for blocks from a concurrent kernel to run.
Your question seems to boil down to "how do I prevent kernel6
from starting until kernel2-5
are finished.
It's possible to use events for this. Basically, you would record an event into each stream, after the kernel2-5 launches, and you would put a cudaStreamWaitEvent
call, one for each of the 4 events, prior to the launch of kernel6
.
Like so:
kernel2<<<1, 32, 0, stream[1]>>>(...);
cudaEventRecord(event1, stream[1]);
kernel3<<<1, 32, 0, stream[2]>>>(...);
cudaEventRecord(event2, stream[2]);
kernel4<<<1, 32, 0, stream[3]>>>(...);
cudaEventRecord(event3, stream[3]);
kernel5<<<1, 32, 0, stream[4]>>>(...);
cudaEventRecord(event4, stream[4]);
// ?? synchronisation ??
cudaStreamWaitEvent(stream[1], event1);
cudaStreamWaitEvent(stream[1], event2);
cudaStreamWaitEvent(stream[1], event3);
cudaStreamWaitEvent(stream[1], event4);
kernel6<<<1, 32, 0, stream[1]>>>(...);
Note that all of the above calls are asynchronous. None of them should take more than a few microseconds to process, and none of them will block the CPU thread from continuing, unlike your usage of cudaDeviceSynchronize()
or cudaStreamSynchronize()
, which generally will block the CPU thread.
As a result, you may want some kind of synchronization after the above sequence (e.g.cudaStreamSynchronize(stream[1]);
) is performed in a loop, or else the asynchronous nature of all this is going to get hairy to figure out (plus, based on your schematic diagram, it seems you probably don't want kernel2-5 of iteration i+1 to begin until kernel6 of iteration i is finished?) Note that I've left out event creation and perhaps other boilerplate for this, I'm assuming you can figure that out or refer to any of the sample codes that use events, or refer to the documentation.
And even if you implement all this infrastructure, your ability to witness (or not) actual kernel concurrency will be dictated by your kernels themselves, not anything I've suggested in this answer. So if you come back and say "I did that, but my kernels are not running concurrently" that is actually a different question than what you have posed, here, and I would refer you for starters to my comment #2 above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With