Theoretically, you can have 65535 blocks per dimension of the grid, up to 65535 * 65535 * 65535.
If you call a kernel like this:
kernel<<< BLOCKS,THREADS >>>()
(without dim3 objects), what is the maximum number available for BLOCKS?
In an application of mine, I've set it up to 192000 and seemed to work fine... The problem is that the kernel I used changes the contents of a huge array, so although I checked some parts of the array and seemed fine, I can't be sure whether the kernel behaved strangely at other parts.
For the record I have a 2.1 GPU, GTX 500 ti.
The Guide K. 1. Features and Technical Specifications points out that Maximum number of threads per block and Maximum x- or y-dimension of a block are both 1024. Thus, the maximum value of block_size can be 1024.
CUDA architecture limits the numbers of threads per block (1024 threads per block limit). The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.
Each SM has a limited number of registers and a limited amount of local memory. For example, no more than 16 thread blocks can run simultaneously on a single SM with the Kepler microarchitecture.
Each CUDA card has a maximum number of threads in a block (512, 1024, or 2048). Each thread also has a thread id: threadId = x + y Dx + z Dx Dy The threadId is like 1D representation of an array in memory. If you are working with 1D vectors, then Dy and Dz could be zero.
With compute capability 3.0 or higher, you can have up to 2^31 - 1
blocks in the x-dimension, and at most 65535 blocks in the y and z dimensions. See Table H.1. Feature Support per Compute Capability of the CUDA C Programming Guide Version 9.1.
As Pavan pointed out, if you do not provide a dim3 for grid configuration, you will only use the x-dimension, hence the per dimension limit applies here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With