As said in this work:
If the instruction stream generated by the CUDA compiler expresses an ILP of 3.0 (that is, an average of three instructions can be executed before a hazard), and the instruction pipeline depth is 22 stages, as few as eight active warps (22 / 3) may be sufficient to completely hide instruction latency and achieve max arithmetic throughput.
I don't understand why it is sufficient?
If the scheduler can successfully issue an instruction from the same warp at every instruction issue cycle for 22 consecutive cycles, then the scheduler has no reason to schedule another warp in its place and that single warp is enough to fill the pipeline. That would correspond to an ILP of at least 22.
But Real-World Code™ never exhibits such kind of high ILP: some instructions for example depend on the result of previous ones or memory requests. When the scheduler can no longer execute independent instructions, the execution of that warp stalls. The scheduler will pick another warp wich is ready to execute, and execute as many instructions as it can until that warp also stalls, and so on.
So if warp #1 successfully executes 3 instructions then stalls, the scheduler picks warp #2, executes 3 instructions... etc. When the scheduler gets to warp #8, there are already 21 instructions in the pipeline for the 7 stalled warps. Executing a single instruction from that warp would then be enough to fill up the pipeline entirely. By the time the pipeline starts to drain, warp #1 is ready again, hence why 8 warps with an ILP of 3 are enough to fill a 22-stages pipeline.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With