Hey I've seen on a website this example kernel
__global__ void loop1( int N, float alpha, float* x, float* y ) {
int i;
int i0 = blockIdx.x*blockDim.x + threadIdx.x;
for(i=i0;i<N;i+=blockDim.x*gridDim.x) {
y[i] = alpha*x[i] + y[i];
}
}
To compute this function in C
for(i=0;i<N;i++) {
y[i] = alpha*x[i] + y[i];
}
Surely the for loop inside the kernel isn't necessary? and you can just do y[i0] = alpha*x[i0] + y[i0]
and remove the for loop altogether.
I'm just curious as to why it's there and what it's purpose is. This is assuming a kernel call such as loop1<<<64,256>>>>
so presumably gridDim.x = 1
CUDA is a parallel computing platform and programming model developed by Nvidia for general computing on its own GPUs (graphics processing units). CUDA enables developers to speed up compute-intensive applications by harnessing the power of GPUs for the parallelizable part of the computation.
As the name suggests, the stride of the loop is the total number of threads in the grid (i.e. blockDim. x * gridDim. x ). __global__ void add(int n, float *x, float *y) { for (int i = blockIdx.
A group of threads is called a CUDA block. CUDA blocks are grouped into a grid. A kernel is executed as a grid of blocks of threads (Figure 2). Each CUDA block is executed by one streaming multiprocessor (SM) and cannot be migrated to other SMs in GPU (except during preemption, debugging, or CUDA dynamic parallelism).
You need the for loop in the kernel if your vector has more entrys than you have started threads. If it's possible it is of course more efficent to start enough threads.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With