This question arises from the differences between the theorical and achieved occupancy observed in a kernel. I'm aware of that different occupancy between calculator and nvprof and also of A question about the details about the distribution from blocks to SMs in CUDA.
Let consider a GPU with a compute capability = 6.1 and 15 SMs (GTX TITAN, Pascal Architecture, Chipset GP104). And let consider a small problem size of 2304 elements.
If we configure a kernel with 512 threads, so each thread will process one element, we need 5 blocks to manipulate all the data. And the kernel is so small that there are not any limit in the use of resources, regarding registers or shared memory.
The theorical occupancy is therefore 1 because four concurrent blocks can be allocated in one SM (2048 threads) leading to 2048 / 32 = 64 active warps (maximum value).
However, the achieved occupancy (reported by the nvidia profiler) is ~0.215 and it is probably related to the way blocks are mapped into the SMs. So, how are the blocks scheduled into the SMs in CUDA when their number is lesser than the available SMs?
Option 1.- schedule 4 blocks of 512 threads into one SM and 1 blocks of 512 in another SM. In this case, the occupancy will be (1 + 0.125) / 2 = 0.56. I supposed that the last block has only 256 of 512 threads active to reach the last 256 elements of the array and it is allocated in the second SM. So only 8 warps are active considering warp granularity.
Option 2.- schedule each block of 512 to a different SMs. As we have 15 SMs, why saturate only one SM with many blocks?. In this case we have 512 / 32 = 16 active warps per SMs (except the last one, which has only 256 active threads). So, we have 0.25 occupancy achieved in four SMs and 0.125 in the last one, leading to (0.25 + 0.25 + 0.25 + 0.25 + 0.125) / 5 = 0.225.
Option 2 is closer to the occupancy reported by the visual profiler and in our opinion is what is happening behind the scenes. Anyway, it's worthy ask it: How are the blocks scheduled into the SMs in CUDA when their number is lesser than the available SMs? Is it documented?
-- Please note this is not a homework. It's a real scenario in a project using different third party libraries having small number of elements to be processed in some steps of the pipeline composed of multiple kernels.
As noted in comments posted over several years to this question, the behaviour of the block scheduler is undefined, and there is no guarantee that it is the same from hardware generation to hardware generation, driver/runtime version to driver/runtime version, or even platform to platform.
It is certainly possible to instrument code with assembler instructions to pull clock and SM IDs and run some cases to see what happens on your device. As Greg Smith pointed out in comments, you will probably come to the conclusion that the scheduler works breadth first, filling SMs to maximum available occupancy as it goes, but it isn't necessarily always like that. Ultimately any heuristics you try and build exploiting your findings would be relying on undefined behaviour.
[Assembled from comments and added as a community wiki entry to get the question off the unanswered queue for the CUDA tag]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With