I just need to clarify something very basic - with most of the computational examples using something like:
ID = blockIdx.x*blockDim.x+threadIdx.x;
// ... then do computation on array[ID]
My question is that if I want to use the maximum number of thread in a block (1024) then do I really need to 'construct' my 'threadID' with consideration of all of (threadIdx.x
, threadIdx.y
, threadIdx.z
) ?
If so, what is a recommended way to hash it into a single value?
If not so, why can someone using it in a similar fashion in image-processing related operations such as in this post:
https://stackoverflow.com/questions/11503406/cuda-addressing-a-matrix
How about blockidx.x
and blockidx.y
, are they in the same shoes as the threaIdx
in this regard?
x variable is similar to the thread index except it refers to the number associated with the block. Let's say you want 2 blocks in a 1D grid with 5 threads in each block. Your threadIdx. x would be 0, 1,…,4 for each block and your blockIdx. x would be 0 and 1 depending on the specific block.
numba.cuda.blockDim. The shape of the block of threads, as declared when instantiating the kernel. This value is the same for all threads in a given kernel, even if they belong to different blocks (i.e. each block is “full”). numba.cuda.blockIdx. The block indices in the grid of threads launched a kernel.
• dim3 gridDim; – Dimensions of the grid in blocks (gridDim.z unused) • dim3 blockDim; – Dimensions of the block in threads.
Creating 2D or 3D threadblocks is usually done because the problem lends itself to a 2D or 3D interpretation of the data, and handling it using a 2D or 3D threadblock may make the code more readable. But there's no specific reason why it cannot be done with a 1D threadblock with appropriate indexing.
Creating a 2D or 3D grid (of blocks) is usually done for the reason described above and/or to get around the limitation on pre CC 3.0 devices of the number of blocks in any one dimension of a grid (65535 max blocks in any dimension).
For the threadblock case, you can use 1024 threads in a single block in a single dimension, so you don't need to construct your ID variable with threadIdx.y
or threadIdx.z
if you don't want to.
If you have a pre CC 3.0 device, and your problem is large enough in terms of blocks, you may still want to construct a 2D grid. You can still use 1D threadblocks in that grid. In that case, a unique ID variable can be created like:
int idx = threadIdx.x + (((gridDim.x * blockIdx.y) + blockIdx.x)*blockDim.x);
The above construct should handle 1D threadblocks with any 2D grid.
There are other methods besides constructing a 2D grid to work with large problem sizes, such as having your blocks handle multiple chunks of data in a loop of some sort.
This is on top of Robert Crovella's answer:
Creating 2D/3D grid is not only just for readability, but also for exploiting 2D/3D locality in the on-chip shared memory, which provides much faster accesses. You can not exploit such locality efficiently with 1D grid, if your problem works on a 2D grid.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With