I am trying to analyze some code I have found online and I keep thinking myself into a corner. I am looking at a histogram kernel launched with the following parameters
histogram<<<2500, numBins, numBins * sizeof(unsigned int)>>>(...);
I know that the parameters are grid, block, shared memory sizes.
So does that mean that there are 2500 blocks of numBins
threads each, each block also having a numBins * sizeof(unsigned int)
chunk of shared memory available to its threads?
Also, within the kernel itself there are calls to __syncthreads()
, are there then 2500 sets of numBins
calls to __syncthreads()
over the course of the kernel call?
In order to run a kernel on the CUDA threads, we need two things. First, in the main() function of the program, we call the function to be executed by each thread on the GPU. This invocation is called Kernel Launch and with it we need provide the number of threads and their grouping.
Figure 1 shows that the CUDA kernel is a function that gets executed on GPU. The parallel portion of your applications is executed K times in parallel by K different CUDA threads, as opposed to only one time like regular C/C++ functions. Figure 1. The kernel is a function executed on the GPU.
__global__ : 1. A qualifier added to standard C. This alerts the compiler that a function should be compiled to run on a device (GPU) instead of host (CPU).
So does that mean that there are 2500 blocks of numBins threads each, each block also having a numBins * sizeof(unsigned int) chunk of shared memory available to its threads?
From the CUDA Toolkit documentation:
The execution configuration (of a global function call) is specified by inserting an expression of the form <<<Dg,Db,Ns,S>>>
, where:
So, as @Fazar pointed out, the answer is yes. This memory is allocated per block.
Also, within the kernel itself there are calls to __syncthreads(), are there then 2500 sets of numBins calls to __syncthreads() over the course of the kernel call?
__syncthreads()
waits until all threads in the thread block have reached this point. Is used to coordinate the communication between threads in the same block.
So, there is a __syncthread()
call per block.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With