CUDA kernel call from within for loop

Question

I have a CUDA kernel that is called from within a for loop. Something like

for(i=0; i<10; i++) {
   myKernel<<<1000,256>>>(A,i);
}

Assume now that I have an NVIDIA card with 15 Stream Multiprocessors (SMs). Also assume, for simplicity, that only one block can be mapped onto an SM, which basically says that most of the time, I will have 15 blocks executed on the device. Since kernel execution is asynchronous, basically the call with i=1 is going to line up for execution right after the first kernel was launched (the one with i=0).

My question is this: at some point when the first kernel (with i=0) is executed, there will be only 14 SMs busy, then only 13, then only 12, then only 11, etc.

Would the kernel with i=1 be sent for execution on the device as soon as one SM is available, or will the launch of this second kernel wait till all the SMs finished dealing with the first kernel (the one with i=0)?

Assume also that I'm working within one CUDA stream.

Eugene · Accepted Answer

Kernel launches in the same stream are serialized. Kernel invocations from different streams may be overlapped given enough resources (SMs, shared memory, etc)

CUDA kernel call from within for loop

Tags:

cuda

user1586099

1 Answers

Eugene

Recent Activity

Donate For Us

CUDA kernel call from within for loop

Tags:

cuda

user1586099

1 Answers

Eugene

Related questions

Recent Activity

Donate For Us