Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA kernel call from within for loop

Tags:

cuda

I have a CUDA kernel that is called from within a for loop. Something like

for(i=0; i<10; i++) {
   myKernel<<<1000,256>>>(A,i);
}

Assume now that I have an NVIDIA card with 15 Stream Multiprocessors (SMs). Also assume, for simplicity, that only one block can be mapped onto an SM, which basically says that most of the time, I will have 15 blocks executed on the device. Since kernel execution is asynchronous, basically the call with i=1 is going to line up for execution right after the first kernel was launched (the one with i=0).

My question is this: at some point when the first kernel (with i=0) is executed, there will be only 14 SMs busy, then only 13, then only 12, then only 11, etc.

Would the kernel with i=1 be sent for execution on the device as soon as one SM is available, or will the launch of this second kernel wait till all the SMs finished dealing with the first kernel (the one with i=0)?

Assume also that I'm working within one CUDA stream.

like image 870
user1586099 Avatar asked Oct 06 '22 23:10

user1586099


1 Answers

Kernel launches in the same stream are serialized. Kernel invocations from different streams may be overlapped given enough resources (SMs, shared memory, etc)

like image 118
Eugene Avatar answered Oct 30 '22 16:10

Eugene