Say you declare a new variable in a CUDA kernel and then use it in multiple threads, like:
__global__ void kernel(float* delt, float* deltb) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
float a;
a = delt[i] + deltb[i];
a += 1;
}
and the kernel call looks something like below, with multiple threads and blocks:
int threads = 200;
uint3 blocks = make_uint3(200,1,1);
kernel<<<blocks,threads>>>(d_delt, d_deltb);
Any variable (scalar or array) declared inside a kernel function, without an extern specifier, is local to each thread, that is each thread has its own "copy" of that variable, no data race among threads will occur!
Compiler chooses whether local variables will reside on registers or in local memory (actually global memory), depending on transformations and optimizations performed by the compiler.
Further details on which variables go on local memory can be found in the NVIDIA CUDA user guide, chapter 5.3.2.2
None of the above. The CUDA compiler is smart enough and aggressive enough with optimisations that it can detect that a
is unused and the complete code can be optimised away.You can confirm this by compiling the kernel with -Xptxas=-v
as an option and look at the resource count, which should be basically no registers and no local memory or heap.
In a less trivial example, a
would probably be stored in a per thread register, or in per thread local memory, which is off-die DRAM.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With