I just need to clarify something very basic - with most of the computational examples using something like: ID = blockIdx.x*blockDim.x+threadIdx.x; // ... then do computation on array[ID] My question is that if I want to use the maximum number of thread in a block (1024) then do I really need to 'construct' my 'threadID' with consideration of all of (<code>threadIdx.x</code>, <code>threadIdx.y</code>, <code>threadIdx.z</code>) ? If so, what is a recommended way to hash it into a single value? If not so, why can someone using it in a similar fashion in image-processing related operations such as in this post: https://stackoverflow.com/questions/11503406/cuda-addressing-a-matrix How about <code>blockidx.x</code> and <code>blockidx.y</code>, are they in the same shoes as the <code>threaIdx</code> in this regard?

Creating 2D or 3D threadblocks is usually done because the problem lends itself to a 2D or 3D interpretation of the data, and handling it using a 2D or 3D threadblock may make the code more readable. But there's no specific reason why it cannot be done with a 1D threadblock with appropriate indexing. Creating a 2D or 3D grid (of blocks) is usually done for the reason described above and/or to get around the limitation on pre CC 3.0 devices of the number of blocks in any one dimension of a grid (65535 max blocks in any dimension). For the threadblock case, you can use 1024 threads in a single block in a single dimension, so you don't need to construct your ID variable with <code>threadIdx.y</code> or <code>threadIdx.z</code> if you don't want to. If you have a pre CC 3.0 device, and your problem is large enough in terms of blocks, you may still want to construct a 2D grid. You can still use 1D threadblocks in that grid. In that case, a unique ID variable can be created like: <pre class="prettyprint"><code> int idx = threadIdx.x + (((gridDim.x * blockIdx.y) + blockIdx.x)*blockDim.x); </code></pre> The above construct should handle 1D threadblocks with any 2D grid. There are other methods besides constructing a 2D grid to work with large problem sizes, such as having your blocks handle multiple chunks of data in a loop of some sort.

CUDA Thread Addressing ((threadIdx.x, threadIdx.y, threadIdx.z) and block addressing (blockidx.x, blockidx.y)

Tags:

cuda

I just need to clarify something very basic - with most of the computational examples using something like:

ID = blockIdx.x*blockDim.x+threadIdx.x;

// ... then do computation on array[ID]

My question is that if I want to use the maximum number of thread in a block (1024) then do I really need to 'construct' my 'threadID' with consideration of all of (threadIdx.x, threadIdx.y, threadIdx.z) ?

If so, what is a recommended way to hash it into a single value?

If not so, why can someone using it in a similar fashion in image-processing related operations such as in this post:

https://stackoverflow.com/questions/11503406/cuda-addressing-a-matrix

How about blockidx.x and blockidx.y, are they in the same shoes as the threaIdx in this regard?

291

asked May 06 '13 17:05

Stochastika

2 Answers

Creating 2D or 3D threadblocks is usually done because the problem lends itself to a 2D or 3D interpretation of the data, and handling it using a 2D or 3D threadblock may make the code more readable. But there's no specific reason why it cannot be done with a 1D threadblock with appropriate indexing.

Creating a 2D or 3D grid (of blocks) is usually done for the reason described above and/or to get around the limitation on pre CC 3.0 devices of the number of blocks in any one dimension of a grid (65535 max blocks in any dimension).

For the threadblock case, you can use 1024 threads in a single block in a single dimension, so you don't need to construct your ID variable with threadIdx.y or threadIdx.z if you don't want to.

If you have a pre CC 3.0 device, and your problem is large enough in terms of blocks, you may still want to construct a 2D grid. You can still use 1D threadblocks in that grid. In that case, a unique ID variable can be created like:

 int idx = threadIdx.x + (((gridDim.x * blockIdx.y) + blockIdx.x)*blockDim.x);

The above construct should handle 1D threadblocks with any 2D grid.

There are other methods besides constructing a 2D grid to work with large problem sizes, such as having your blocks handle multiple chunks of data in a loop of some sort.

answered Sep 29 '22 08:09

Robert Crovella

This is on top of Robert Crovella's answer:

Creating 2D/3D grid is not only just for readability, but also for exploiting 2D/3D locality in the on-chip shared memory, which provides much faster accesses. You can not exploit such locality efficiently with 1D grid, if your problem works on a 2D grid.

answered Sep 29 '22 07:09

user2030440

Related questions
                            
                                CUDA Double pointer memory copy [duplicate]
                            
                                Is CUDA warp scheduling deterministic?
                            
                                Cuda: pinned memory zero copy problems
                            
                                Numba CUDA shared memory size at runtime?
                            
                                Doubling buffering in CUDA so the CPU can operate on data produced by a persistent kernel
                            
                                Implementing a quadtree using arrays
                            
                                Weak guarantees for non-atomic writes on GPUs?
                            
                                CMake 3.4.3 Can't find CUDA on windows
                            
                                Why does VS2019 Pro have compile errors with xutility, xmemory, and atomic when creating a CUDA project via CMake?
                            
                                Tutorial for CUDA + OpenGl [closed]
                            
                                Cummulative array summation using OpenCL
                            
                                matrix multiplication in cuda
                            
                                How many memory latency cycles per memory access type in OpenCL/CUDA?
                            
                                Using CUDA Occupancy Calculator
                            
                                CUBLAS - is matrix-element exponentiation possible?
                            
                                CUDA: Shared memory over a large-ish 2D array
                            
                                CUDA __device__ of type struct
                            
                                Copying memory allocated by cudaMallocPitch
                            
                                Mix custom memory management and Thrust in CUDA
                            
                                read the output of nvprof in CUDA

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With