How does Nvidia's Fermi GPU issue threadblocks to streaming multiprocessor

Question

Assume I have 8 threadblocks and my GPU has 8 SMs. Then how does GPU issue this threadblocks to the SMs?

I found some programs or articles suggest a breadth-first manner, that is , each SM runs a threadblock in this example. However, according to a few documents, increasing occupancy may be a good idea if GPU kernels are latency-limited. It might be inferred that 8 threadblocks will run on 4 or less SMs if it can.

I wonder which one is the reality. Thanks in advance.

Pedro · Accepted Answer

It's hard to tell what the GPU is doing exactly. If you have a specific kernel you're interested in, you could try reading and storing the %smid register for each block.

An example of how to do this is given here.

How does Nvidia's Fermi GPU issue threadblocks to streaming multiprocessor

Tags:

cuda

gpu

multiprocessor

Antony Yu

1 Answers

Pedro

Recent Activity

Donate For Us

How does Nvidia's Fermi GPU issue threadblocks to streaming multiprocessor

Tags:

cuda

gpu

multiprocessor

Antony Yu

1 Answers

Pedro

Related questions

Recent Activity

Donate For Us