I have a CUDA kernel that is called from within a for loop. Something like
for(i=0; i<10; i++) {
myKernel<<<1000,256>>>(A,i);
}
Assume now that I have an NVIDIA card with 15 Stream Multiprocessors (SMs). Also assume, for simplicity, that only one block can be mapped onto an SM, which basically says that most of the time, I will have 15 blocks executed on the device. Since kernel execution is asynchronous, basically the call with i=1 is going to line up for execution right after the first kernel was launched (the one with i=0).
My question is this: at some point when the first kernel (with i=0) is executed, there will be only 14 SMs busy, then only 13, then only 12, then only 11, etc.
Would the kernel with i=1 be sent for execution on the device as soon as one SM is available, or will the launch of this second kernel wait till all the SMs finished dealing with the first kernel (the one with i=0)?
Assume also that I'm working within one CUDA stream.
Kernel launches in the same stream are serialized. Kernel invocations from different streams may be overlapped given enough resources (SMs, shared memory, etc)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With