I am trying to analyze some code I have found online and I keep thinking myself into a corner. I am looking at a histogram kernel launched with the following parameters <pre class="prettyprint"><code>histogram<<<2500, numBins, numBins * sizeof(unsigned int)>>>(...); </code></pre> I know that the parameters are grid, block, shared memory sizes. So does that mean that there are 2500 blocks of <code>numBins</code> threads each, each block also having a <code>numBins * sizeof(unsigned int)</code> chunk of shared memory available to its threads? Also, within the kernel itself there are calls to <code>__syncthreads()</code>, are there then 2500 sets of <code>numBins</code> calls to <code>__syncthreads()</code> over the course of the kernel call?

<blockquote> So does that mean that there are 2500 blocks of numBins threads each, each block also having a numBins * sizeof(unsigned int) chunk of shared memory available to its threads? </blockquote> From the CUDA Toolkit documentation: The execution configuration (of a global function call) is specified by inserting an expression of the form <code><<<Dg,Db,Ns,S>>></code>, where: <ul> <li> Dg (dim3) specifies the dimension and size of the grid.</li> <li> Db (dim3) specifies the dimension and size of each block</li> <li> Ns (size_t) specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory.</li> <li> S (cudaStream_t) specifies the associated stream, is an optional parameter which defaults to 0.</li> </ul> So, as @Fazar pointed out, the answer is yes. This memory is allocated per block. <blockquote> Also, within the kernel itself there are calls to __syncthreads(), are there then 2500 sets of numBins calls to __syncthreads() over the course of the kernel call? </blockquote> <code>__syncthreads()</code> waits until all threads in the thread block have reached this point. Is used to coordinate the communication between threads in the same block. So, there is a <code>__syncthread()</code> call per block.

Understanding this CUDA kernels launch parameters

Tags:

cuda

I am trying to analyze some code I have found online and I keep thinking myself into a corner. I am looking at a histogram kernel launched with the following parameters

histogram<<<2500, numBins, numBins * sizeof(unsigned int)>>>(...);

I know that the parameters are grid, block, shared memory sizes.

So does that mean that there are 2500 blocks of numBins threads each, each block also having a numBins * sizeof(unsigned int) chunk of shared memory available to its threads?

Also, within the kernel itself there are calls to __syncthreads(), are there then 2500 sets of numBins calls to __syncthreads() over the course of the kernel call?

710

asked Nov 06 '14 01:11

KDecker

1 Answers

So does that mean that there are 2500 blocks of numBins threads each, each block also having a numBins * sizeof(unsigned int) chunk of shared memory available to its threads?

From the CUDA Toolkit documentation:

The execution configuration (of a global function call) is specified by inserting an expression of the form <<<Dg,Db,Ns,S>>>, where:

Dg (dim3) specifies the dimension and size of the grid.
Db (dim3) specifies the dimension and size of each block
Ns (size_t) specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory.
S (cudaStream_t) specifies the associated stream, is an optional parameter which defaults to 0.

So, as @Fazar pointed out, the answer is yes. This memory is allocated per block.

Also, within the kernel itself there are calls to __syncthreads(), are there then 2500 sets of numBins calls to __syncthreads() over the course of the kernel call?

__syncthreads() waits until all threads in the thread block have reached this point. Is used to coordinate the communication between threads in the same block.

So, there is a __syncthread() call per block.

answered Sep 22 '22 05:09

srodrb

Related questions
                            
                                Best practice for upgrading CUDA and cuDNN for tensorflow
                            
                                Forcing CUDA to use register for a variable
                            
                                What is the maximum block count possible in CUDA?
                            
                                Which CUDA Toolkit version for older NVIDIA Driver
                            
                                Easiest way to test for existence of cuda-capable GPU from cmake?
                            
                                Installing theano on Windows 8 with GPU enabled
                            
                                Timing CUDA operations
                            
                                Funnel shift - what is it?
                            
                                Financial applications on GPGPU
                            
                                How to calculate the speedup of a GPU program?
                            
                                Can I use C++11 in the .cu-files (CUDA5.5) in Windows7x64 (MSVC) and Linux64 (GCC4.8.2)?
                            
                                Could not insert 'nvidia_352': No such device
                            
                                Are there advantages to using the CUDA vector types?
                            
                                How to find epsilon, min and max constants for CUDA?
                            
                                TensorFlow: libcudart.so.7.5: cannot open shared object file: No such file or directory
                            
                                CUDA Runtime API error 38: no CUDA-capable device is detected
                            
                                CUDA __device__ Unresolved extern function [duplicate]
                            
                                Lambda expressions with CUDA
                            
                                Multiple processes launching CUDA kernels in parallel
                            
                                Copy an object to device?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With