Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Maximum number of resident threads per multiprocessor VS. Maximum number of resident blocks per multiprocessor

Tags:

cuda

hpc

I'm running an issue on my K20 about resources with concurrent kernel execution. My streams only got a little overlap and then I thought this might because of a resources limitation. So I referred to the manual, and I found this: The maximum number of resident blocks per multiprocessor is 16 and the maximum number of resident threads per multiprocessor is 2048.

So my question is: if I have a kernel of 96 blocks of 1024 threads in each block. How many SMs will this kernel use in parallel?

Answer 1: 96/16 = 6

Answer 2: 1024/2048*96 = 48 ( K20 only has 13 SMs, so how will this kernel behave? )

Or maybe you have another answer?

like image 959
Archeosudoerus Avatar asked Feb 16 '23 05:02

Archeosudoerus


2 Answers

The number blocks being used per SM depends on the following.

  1. A hard limit on number of blocks per SM.
  2. Number of threads per block.
  3. Amount of shared memory per block.
  4. Number of registers used per block.

Assuming shared memory and registers are not limiting factors, let us look at a couple of cases.

Case 1 32 threads per block and 64 blocks.

Just looking at number of threads provides you with an answer of 64 blocks and 1 SM. But you have a hard limit of 16 blocks per SM. In this case (2) is not the limiting constraint, but (1) is. So you have 16 blocks per SM and 4 SMs used.

Case 2 1024 threads per block and 32 blocks.

In this case (2) is the limiting factor. You can only have 2048 threads per SM, leaving you with 2 blocks per SM and 16 SMs being used (obviously there will be some block switching involved).

Case 3 1024 threads per block, 96 blocks. as presented in the question.

Similar to above, (2) is the limiting factor. You are only using 2 blocks per SM. 48 SMs are required theoretically. Only 26 (13x2) blocks are "active" at any given point. CUDA should take care of switching the blocks that are inactive with those that need to be processed.

TL;DR The constraint giving you the fewer number of blocks per SM is the limiting constraint.

like image 151
Pavan Yalamanchili Avatar answered Feb 18 '23 19:02

Pavan Yalamanchili


Quoting the CUDA C Programming Guide:

The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor.

There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor. These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Appendix F. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch.

So, you should better talk about maximum number of blocks per multiprocessor, since the actual number depends on the amount of registers and shared memory used, as the guide says.

For the case you mentioned, I would say that the kernel will use simultaneously all the SMs which, at the best, will host 2 blocks each, for a number of 26 blocks simultaneously resident on the card.

I recommend the following reference:

Shane Cook, CUDA Programming, A Developer's Guide to Parallel Computing with GPUs, Chapter 5 and Chapter 9, Strategy 4, Register usage.

like image 31
Vitality Avatar answered Feb 18 '23 18:02

Vitality