As title, I would like to know the right execution order in case we have a 3d block I think to remember that I read already something regarding it, but it was some time ago, I dont remember where but it was coming by someone who didnt look so reliable.. Anyway I would like to have some confirmations about it. Is it as the following (divided in warps)? [0, 0, 0]...[blockDim.x, 0, 0] - [0, 1, 0]...[blockDim.x, 1, 0] - (...) - [0, blockDim.y, 0]...[blockDim.x, blockDim.y, 0] - [0, 0, 1]...[blockDim.x, 0, 1] - (...) - [0, blockDim.y, 1]...[blockDim.x, blockDim.y, 1] - (...) - [blockDim.x, blockDim.y, blockDim.z]

Yes, that is the correct ordering; threads are ordered with the x dimension varying first, then y, then z (equivalent to column-major order) within a block. The calculation can be expressed as <pre class="prettyprint"><code>int threadID = threadIdx.x + blockDim.x * threadIdx.y + (blockDim.x * blockDim.y) * threadIdx.z; int warpID = threadID / warpSize; int laneID = threadID % warpsize; </code></pre> Here <code>threadID</code> is the thread number within the block, <code>warpID</code> is the warp within the block and <code>laneID</code> is the thread number within the warp. Note that threads are not necessarily executed in any sort of predicable order related to this ordering within a block. The execution model guarantees that threads in the same warp are executed "lock-step", but you can't infer any more than that from the thread numbering within a block.

Cuda, executional thread order in a 3d-block

Tags:

cuda

As title, I would like to know the right execution order in case we have a 3d block

I think to remember that I read already something regarding it, but it was some time ago, I dont remember where but it was coming by someone who didnt look so reliable..

Anyway I would like to have some confirmations about it.

Is it as the following (divided in warps)?

[0, 0, 0]...[blockDim.x, 0, 0] - [0, 1, 0]...[blockDim.x, 1, 0] - (...) - [0, blockDim.y, 0]...[blockDim.x, blockDim.y, 0] - [0, 0, 1]...[blockDim.x, 0, 1] - (...) - [0, blockDim.y, 1]...[blockDim.x, blockDim.y, 1] - (...) - [blockDim.x, blockDim.y, blockDim.z]

259

asked Jul 16 '12 13:07

elect

1 Answers

Yes, that is the correct ordering; threads are ordered with the x dimension varying first, then y, then z (equivalent to column-major order) within a block. The calculation can be expressed as

int threadID = threadIdx.x + 
               blockDim.x * threadIdx.y + 
               (blockDim.x * blockDim.y) * threadIdx.z;

int warpID = threadID / warpSize;
int laneID = threadID % warpsize;

Here threadID is the thread number within the block, warpID is the warp within the block and laneID is the thread number within the warp.

Note that threads are not necessarily executed in any sort of predicable order related to this ordering within a block. The execution model guarantees that threads in the same warp are executed "lock-step", but you can't infer any more than that from the thread numbering within a block.

answered Oct 09 '22 06:10

talonmies

Related questions
                            
                                Global memory access and L1 cache in Kepler
                            
                                How to have Apache Spark running on GPU?
                            
                                CUDA device runtime api cudaMemsetAsync doesn't work
                            
                                Call multiple times get_global_id() vs save the result in the local variable?
                            
                                Problem when calling template CUDA kernel
                            
                                Invalid argument in cudaMemcpy3D using width in bytes?
                            
                                How good is OpenCV GPU library for matrix operations?
                            
                                How to debug CUDA using eclipse Nsight with only one GPU
                            
                                How to measure GPU vs CPU performance? Which time measurement functions?
                            
                                How to generate, compile and run CUDA kernels at runtime
                            
                                Generating random numbers: CPU vs GPU, which currently wins?
                            
                                compilation .cu files with Dynamic Parallelism(CUDA)
                            
                                cuda, OpenGL interoperability: cudaErrorMemoryAllocation error on cudaGraphicsGLRegisterBuffer
                            
                                Why there are two warp schedulers in a SM of GPU?
                            
                                CUDA: LNK2005 error on __device__ function used in header file
                            
                                Synchronizations in GPUs
                            
                                AMD equivalent of the CUDA Driver API?
                            
                                How to emulate CUDA on windows
                            
                                Sum reduction with CUDA: What is N?
                            
                                Passing a C++/CUDA class to PyCUDA's SourceModule

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With