Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Max number of threads which can be initiated in a single CUDA kernel

Tags:

cuda

gpu

thrust

I am confused about the maximum number of threads which can be launched in a Fermi GPU.

My GTX 570 device query says the following.

  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535

From my understanding, I understand the above statement as:

For a CUDA kernel we can launch at most 65536 blocks. Each launched block can contain up to 1024 threads. Hence in principle, I can launch up to 65536*1024 (=67108864) threads.

Is this correct? What if my thread uses a lot registers? Will we still be able to reach this theoretical maximum of the number of threads?

After writing and launching the CUDA kernel, how do I know that the number of threads and blocks that I have launched have indeed been instantiated. I mean I dont want the GPU to calculate some junk, or behave weirdly, if I have by chance instantiated more threads than are possible for that particular kernel.

like image 624
smilingbuddha Avatar asked Aug 22 '12 17:08

smilingbuddha


People also ask

How many threads can I run CUDA?

Remember: CUDA Streaming Multiprocessor executes threads in warps (32 threads) There is a maximum of 1024 threads per block (for our GPU)

What is the maximum number threads can be launched on the GPU?

The limit for the number of threads in a GPU is 1024 (in your case), but also depends on the amount of shared memory each thread is asking for and the number of registers each thread needs.

What is the maximum number of blocks supported by CUDA?

Theoretically, you can have 65535 blocks per dimension of the grid, up to 65535 * 65535 * 65535.

How many threads are in a block CUDA?

Here are a few noticeable points: CUDA architecture limits the numbers of threads per block (1024 threads per block limit). The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.


1 Answers

For a CUDA kernel we can launch at most 65536 blocks. Each launched block can contain upto 1024 threads. Hence in principle, I can launch up to 65536*1024 (=67108864) threads.

No this is not correct. You can launch a grid of up to 65535 x 65535 x 65535 blocks, and each block has a maximum of 1024 threads per block, although per thread resource limitation might restrict the total number of threads per block to less than this maximum.

What if my thread uses a lot registers? Will we still be able to reach this theoretical maximum of the number of threads?

No, you will not be able to reach the maximum threads per block in this case. Each release of the NVIDIA CUDA toolkit includes an occupancy calculator spreadsheet you can use to see the effect of register pressure on the limiting block size.

Also, after writing and launching the CUDA kernel, how do I know that the number of threads and blocks that I have launched have indeed been instantiated. I mean I dont want the GPU to calculate some junk, or behace weirdly, if I have by chance instantiated more threads than are possible for that particular kernel.

If you choose an illegal execution configuration (so incorrect block size or grid size) the kernel will not launch and the runtime will issue a cudaErrorInvalidConfiguration error message. You can use the standard cudaPeekAtLastError() and cudaGetLastError() to check the status of any kernel launch.

like image 71
talonmies Avatar answered Oct 20 '22 11:10

talonmies