If I start my kernel with a grid whose blocks have dimensions:
dim3 block_dims(16,16);
How are the grid blocks now split into warps? Do the first two rows of such a block form one warp, or the first two columns, or is this arbitrarily-ordered?
Assume a GPU Compute Capability of 2.0.
On the K20 GPUs on Stampede, each CUDA core may execute 32 threads simultaneously. Therefore, blocks are divided into warps of 32 threads for execution.
In an NVIDIA GPU, the basic unit of execution is the warp. A warp is a collection of threads, 32 in current implementations, that are executed simultaneously by an SM. Multiple warps can be executed on an SM at once.
Blocks can be organized into one, two or three-dimensional grids of up to 231-1, 65,535 and 65,535 blocks in the x, y and z dimensions respectively.
Choosing the number of threads per block is very complicated. Most CUDA algorithms admit a large range of possibilities, and the choice is based on what makes the kernel run most efficiently. It is almost always a multiple of 32, and at least 64, because of how the thread scheduling hardware works.
Threads are numbered in order within blocks so that threadIdx.x
varies the fastest, then threadIdx.y
the second fastest varying, and threadIdx.z
the slowest varying. This is functionally the same as column major ordering in multidimensional arrays. Warps are sequentially constructed from threads in this ordering. So the calculation for a 2d block is
unsigned int tid = threadIdx.x + threadIdx.y * blockDim.x;
unsigned int warpid = tid / warpSize;
This is covered both in the programming guide and the PTX guide.
To illustrate @talonmies's answer through 'Visual Studio WarpWatch' window for two consecutive warps (dim3 block_dims(16,16);
and WarpSize = 32):
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With