CUDA - what is this loop doing

Tags:

cuda

Hey I've seen on a website this example kernel

 __global__ void loop1( int N, float alpha, float* x, float* y ) {
   int i;
   int i0 = blockIdx.x*blockDim.x + threadIdx.x;

   for(i=i0;i<N;i+=blockDim.x*gridDim.x) {
      y[i] = alpha*x[i] + y[i];
    }
}

To compute this function in C

Click to copy

   for(i=0;i<N;i++) {
      y[i] = alpha*x[i] + y[i];
   }

Surely the for loop inside the kernel isn't necessary? and you can just do y[i0] = alpha*x[i0] + y[i0] and remove the for loop altogether.

I'm just curious as to why it's there and what it's purpose is. This is assuming a kernel call such as loop1<<<64,256>>>> so presumably gridDim.x = 1

208

asked Mar 16 '11 20:03

user660414

1 Answers

You need the for loop in the kernel if your vector has more entrys than you have started threads. If it's possible it is of course more efficent to start enough threads.

168

answered Sep 29 '22 13:09

moggi

Related questions
                            
                                CUDA C# .Net Example Project? VS2010
                            
                                Removing __syncthreads() in CUDA warp-level reduction
                            
                                Depth-first search in CUDA / OpenCL
                            
                                The cost of CUDA global memory transactions
                            
                                Will 32 threads from 32 block be scheduled as a warp?
                            
                                How to differentiate between pointers to shared and global memory?
                            
                                Difference between memcpy_htod and to_gpu in Pycuda?
                            
                                Load/Store Units (LD/ST) and Special Function Units (SFUs) for the Kepler architecture
                            
                                Miscellaneous and Inter-Thread Communication Instructions in CUDA
                            
                                openCV 2.4.9 compilation error with CUDA 6.5
                            
                                Why launch a multiple of 32 number of threads in CUDA?
                            
                                CPU memory access latency of data allocated with malloc() vs. cudaHostAlloc() on Tegra TK1
                            
                                Issues with compiling Caffe with cuDNN
                            
                                Why should I use CUDA __shared__ memory as "extern"
                            
                                Warning when compiling .cu with -Wpedantic: "style of line directive is a GCC extension"
                            
                                Performance of CUDAfy module
                            
                                How do you include standard CUDA libraries to link with NVRTC code?
                            
                                How to add more than one CUDA gencode using modern CMAKE (per target)?
                            
                                What are the "long" and "short" scoreboards w.r.t. MIO/L1TEX?
                            
                                Shared Memory Bank Conflicts in CUDA: How memory is aligned to banks

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With