I am confused about the maximum number of threads which can be launched in a Fermi GPU.
My GTX 570 device query says the following.
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
From my understanding, I understand the above statement as:
For a CUDA kernel we can launch at most 65536 blocks. Each launched block can contain up to 1024 threads. Hence in principle, I can launch up to 65536*1024 (=67108864) threads.
Is this correct? What if my thread uses a lot registers? Will we still be able to reach this theoretical maximum of the number of threads?
After writing and launching the CUDA kernel, how do I know that the number of threads and blocks that I have launched have indeed been instantiated. I mean I dont want the GPU to calculate some junk, or behave weirdly, if I have by chance instantiated more threads than are possible for that particular kernel.
Remember: CUDA Streaming Multiprocessor executes threads in warps (32 threads) There is a maximum of 1024 threads per block (for our GPU)
The limit for the number of threads in a GPU is 1024 (in your case), but also depends on the amount of shared memory each thread is asking for and the number of registers each thread needs.
Theoretically, you can have 65535 blocks per dimension of the grid, up to 65535 * 65535 * 65535.
Here are a few noticeable points: CUDA architecture limits the numbers of threads per block (1024 threads per block limit). The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.
For a CUDA kernel we can launch at most 65536 blocks. Each launched block can contain upto 1024 threads. Hence in principle, I can launch up to 65536*1024 (=67108864) threads.
No this is not correct. You can launch a grid of up to 65535 x 65535 x 65535 blocks, and each block has a maximum of 1024 threads per block, although per thread resource limitation might restrict the total number of threads per block to less than this maximum.
What if my thread uses a lot registers? Will we still be able to reach this theoretical maximum of the number of threads?
No, you will not be able to reach the maximum threads per block in this case. Each release of the NVIDIA CUDA toolkit includes an occupancy calculator spreadsheet you can use to see the effect of register pressure on the limiting block size.
Also, after writing and launching the CUDA kernel, how do I know that the number of threads and blocks that I have launched have indeed been instantiated. I mean I dont want the GPU to calculate some junk, or behace weirdly, if I have by chance instantiated more threads than are possible for that particular kernel.
If you choose an illegal execution configuration (so incorrect block size or grid size) the kernel will not launch and the runtime will issue a cudaErrorInvalidConfiguration
error message. You can use the standard cudaPeekAtLastError()
and cudaGetLastError()
to check the status of any kernel launch.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With