Assume I have 8 threadblocks and my GPU has 8 SMs. Then how does GPU issue this threadblocks to the SMs?
I found some programs or articles suggest a breadth-first manner, that is , each SM runs a threadblock in this example. However, according to a few documents, increasing occupancy may be a good idea if GPU kernels are latency-limited. It might be inferred that 8 threadblocks will run on 4 or less SMs if it can.
I wonder which one is the reality. Thanks in advance.
It's hard to tell what the GPU is doing exactly. If you have a specific kernel you're interested in, you could try reading and storing the %smid register for each block.
An example of how to do this is given here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With