Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does CUDA GPU only need 8 active warps?

Tags:

cuda

gpu

As said in this work:

If the instruction stream generated by the CUDA compiler expresses an ILP of 3.0 (that is, an average of three instructions can be executed before a hazard), and the instruction pipeline depth is 22 stages, as few as eight active warps (22 / 3) may be sufficient to completely hide instruction latency and achieve max arithmetic throughput.

I don't understand why it is sufficient?

like image 634
9__ Avatar asked Apr 27 '26 13:04

9__


1 Answers

If the scheduler can successfully issue an instruction from the same warp at every instruction issue cycle for 22 consecutive cycles, then the scheduler has no reason to schedule another warp in its place and that single warp is enough to fill the pipeline. That would correspond to an ILP of at least 22.

But Real-World Code™ never exhibits such kind of high ILP: some instructions for example depend on the result of previous ones or memory requests. When the scheduler can no longer execute independent instructions, the execution of that warp stalls. The scheduler will pick another warp wich is ready to execute, and execute as many instructions as it can until that warp also stalls, and so on.

So if warp #1 successfully executes 3 instructions then stalls, the scheduler picks warp #2, executes 3 instructions... etc. When the scheduler gets to warp #8, there are already 21 instructions in the pipeline for the 7 stalled warps. Executing a single instruction from that warp would then be enough to fill up the pipeline entirely. By the time the pipeline starts to drain, warp #1 is ready again, hence why 8 warps with an ILP of 3 are enough to fill a 22-stages pipeline.

like image 197
user703016 Avatar answered Apr 30 '26 03:04

user703016



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!