Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding this CUDA kernels launch parameters

Tags:

cuda

I am trying to analyze some code I have found online and I keep thinking myself into a corner. I am looking at a histogram kernel launched with the following parameters

histogram<<<2500, numBins, numBins * sizeof(unsigned int)>>>(...); 

I know that the parameters are grid, block, shared memory sizes.

So does that mean that there are 2500 blocks of numBins threads each, each block also having a numBins * sizeof(unsigned int) chunk of shared memory available to its threads?

Also, within the kernel itself there are calls to __syncthreads(), are there then 2500 sets of numBins calls to __syncthreads() over the course of the kernel call?

like image 710
KDecker Avatar asked Nov 06 '14 01:11

KDecker


People also ask

What is the correct way to launch a CUDA kernel?

In order to run a kernel on the CUDA threads, we need two things. First, in the main() function of the program, we call the function to be executed by each thread on the GPU. This invocation is called Kernel Launch and with it we need provide the number of threads and their grouping.

What are CUDA kernels?

Figure 1 shows that the CUDA kernel is a function that gets executed on GPU. The parallel portion of your applications is executed K times in parallel by K different CUDA threads, as opposed to only one time like regular C/C++ functions. Figure 1. The kernel is a function executed on the GPU.

What is function of __ global __ qualifier in CUDA program?

__global__ : 1. A qualifier added to standard C. This alerts the compiler that a function should be compiled to run on a device (GPU) instead of host (CPU).


1 Answers

So does that mean that there are 2500 blocks of numBins threads each, each block also having a numBins * sizeof(unsigned int) chunk of shared memory available to its threads?

From the CUDA Toolkit documentation:

The execution configuration (of a global function call) is specified by inserting an expression of the form <<<Dg,Db,Ns,S>>>, where:

  • Dg (dim3) specifies the dimension and size of the grid.
  • Db (dim3) specifies the dimension and size of each block
  • Ns (size_t) specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory.
  • S (cudaStream_t) specifies the associated stream, is an optional parameter which defaults to 0.

So, as @Fazar pointed out, the answer is yes. This memory is allocated per block.

Also, within the kernel itself there are calls to __syncthreads(), are there then 2500 sets of numBins calls to __syncthreads() over the course of the kernel call?

__syncthreads() waits until all threads in the thread block have reached this point. Is used to coordinate the communication between threads in the same block.

So, there is a __syncthread() call per block.

like image 60
srodrb Avatar answered Sep 22 '22 05:09

srodrb