Max number of threads which can be initiated in a single CUDA kernel

Tags:

I am confused about the maximum number of threads which can be launched in a Fermi GPU.

My GTX 570 device query says the following.

  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535

From my understanding, I understand the above statement as:

For a CUDA kernel we can launch at most 65536 blocks. Each launched block can contain up to 1024 threads. Hence in principle, I can launch up to 65536*1024 (=67108864) threads.

Is this correct? What if my thread uses a lot registers? Will we still be able to reach this theoretical maximum of the number of threads?

After writing and launching the CUDA kernel, how do I know that the number of threads and blocks that I have launched have indeed been instantiated. I mean I dont want the GPU to calculate some junk, or behave weirdly, if I have by chance instantiated more threads than are possible for that particular kernel.

624

asked Aug 22 '12 17:08

smilingbuddha

1 Answers

For a CUDA kernel we can launch at most 65536 blocks. Each launched block can contain upto 1024 threads. Hence in principle, I can launch up to 65536*1024 (=67108864) threads.

No this is not correct. You can launch a grid of up to 65535 x 65535 x 65535 blocks, and each block has a maximum of 1024 threads per block, although per thread resource limitation might restrict the total number of threads per block to less than this maximum.

What if my thread uses a lot registers? Will we still be able to reach this theoretical maximum of the number of threads?

No, you will not be able to reach the maximum threads per block in this case. Each release of the NVIDIA CUDA toolkit includes an occupancy calculator spreadsheet you can use to see the effect of register pressure on the limiting block size.

Also, after writing and launching the CUDA kernel, how do I know that the number of threads and blocks that I have launched have indeed been instantiated. I mean I dont want the GPU to calculate some junk, or behace weirdly, if I have by chance instantiated more threads than are possible for that particular kernel.

If you choose an illegal execution configuration (so incorrect block size or grid size) the kernel will not launch and the runtime will issue a cudaErrorInvalidConfiguration error message. You can use the standard cudaPeekAtLastError() and cudaGetLastError() to check the status of any kernel launch.

answered Oct 20 '22 11:10

talonmies

Related questions
                            
                                Parallel GPU computing using OpenCV
                            
                                efficient way of cuda file organization: .cpp .h .cu .cuh .curnel files
                            
                                Add nvidia runtime to docker runtimes
                            
                                How CUDA constant memory allocation works?
                            
                                Use dynamic shared memory allocation for two different vectors
                            
                                What is the role of cudaDeviceReset() in Cuda
                            
                                Cuda with Boost
                            
                                Several threads writing the same value in the same global memory location
                            
                                Why are CUDA vector types (int4, float4) faster?
                            
                                Thrust: How to create device_vector from host array? [duplicate]
                            
                                OpenMP 4.0 in GCC: offload to nVidia GPU
                            
                                How do I run MATLAB code on the GPU using CUDA?
                            
                                Which NVIDIA cuDNN release type for TensorFlow on Ubuntu 16.04 [closed]
                            
                                Tensorflow: I installed CUDA 9.2 but it needs 9.0?
                            
                                ctags+taglist for .cu (CUDA) files
                            
                                Error compiling Cuda - expected primary-expression
                            
                                Fastest (or most elegant) way of passing constant arguments to a CUDA kernel
                            
                                why is my c program suddenly using 30g of virtual memory?
                            
                                CUDA disable L1 cache only for one variable
                            
                                PTX - what is a CTA?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Max number of threads which can be initiated in a single CUDA kernel

Tags:

cuda

gpu

thrust

smilingbuddha

People also ask

1 Answers

talonmies

Recent Activity

Donate For Us