Why does CUDA GPU only need 8 active warps?

Question

As said in this work:

If the instruction stream generated by the CUDA compiler expresses an ILP of 3.0 (that is, an average of three instructions can be executed before a hazard), and the instruction pipeline depth is 22 stages, as few as eight active warps (22 / 3) may be sufficient to completely hide instruction latency and achieve max arithmetic throughput.

I don't understand why it is sufficient?

user703016 · Accepted Answer

If the scheduler can successfully issue an instruction from the same warp at every instruction issue cycle for 22 consecutive cycles, then the scheduler has no reason to schedule another warp in its place and that single warp is enough to fill the pipeline. That would correspond to an ILP of at least 22.

But Real-World Code™ never exhibits such kind of high ILP: some instructions for example depend on the result of previous ones or memory requests. When the scheduler can no longer execute independent instructions, the execution of that warp stalls. The scheduler will pick another warp wich is ready to execute, and execute as many instructions as it can until that warp also stalls, and so on.

So if warp #1 successfully executes 3 instructions then stalls, the scheduler picks warp #2, executes 3 instructions... etc. When the scheduler gets to warp #8, there are already 21 instructions in the pipeline for the 7 stalled warps. Executing a single instruction from that warp would then be enough to fill up the pipeline entirely. By the time the pipeline starts to drain, warp #1 is ready again, hence why 8 warps with an ILP of 3 are enough to fill a 22-stages pipeline.

Why does CUDA GPU only need 8 active warps?

Tags:

cuda

gpu

9__

1 Answers

user703016

Recent Activity

Donate For Us

Why does CUDA GPU only need 8 active warps?

Tags:

cuda

gpu

9__

1 Answers

user703016

Related questions

Recent Activity

Donate For Us