Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is static and dynamic scheduling on GPUs?

Tags:

cuda

gpu

nvidia

GTX 4xx, 5xx (Fermi) had dynamic scheduling and GTX 6xx (Kepler) switched to static scheduling.

  • What is static and dynamic scheduling in the context of GPUs?
  • How does the design choice of static vs. dynamic affect the performance of real world compute workloads?
  • Is there anything that can be done in code to optimize an algorithm for static or dynamic scheduling?
like image 816
Roger Dahl Avatar asked Nov 28 '25 05:11

Roger Dahl


1 Answers

I assume you're referring to static/dynamic instruction scheduling in hardware.

Dynamic instruction scheduling means that the processor may re-order the individual instructions at runtime. This usually involves some bit of hardware that will try to predict the best order for whatever is in the instruction pipeline. On the GPUs you mentioned, this refers to the re-ordering of instructions for each individual warp.

The reason for switching from a dynamic scheduler back to a static scheduler is described in the GK110 Architecture Whitepaper as follows:

We also looked for opportunities to optimize the power in the SMX warp scheduler logic. For example, both Kepler and Fermi schedulers contain similar hardware units to handle the scheduling function, including:

  • Register scoreboarding for long latency operations (texture and load)

  • Inter‐warp scheduling decisions (e.g., pick the best warp to go next among eligible candidates)

  • Thread block level scheduling (e.g., the GigaThread engine)

However, Fermi’s scheduler also contains a complex hardware stage to prevent data hazards in the math datapath itself. A multi‐port register scoreboard keeps track of any registers that are not yet ready with valid data, and a dependency checker block analyzes register usage across a multitude of fully decoded warp instructions against the scoreboard, to determine which are eligible to issue.

For Kepler, we recognized that this information is deterministic (the math pipeline latencies are not variable), and therefore it is possible for the compiler to determine up front when instructions will be ready to issue, and provide this information in the instruction itself. This allowed us to replace several complex and power‐expensive blocks with a simple hardware block that extracts the pre‐determined latency information and uses it to mask out warps from eligibility at the inter‐warp scheduler stage.

So basically, they're trading chip complexity, i.e. a simpler scheduler, for efficiency. But that potentially lost efficiency is now picked up by the compiler which can predict the best order, at least for the math pipeline.

As for your final question, i.e. what can be done in code to optimize an algorithm for static or dynamic scheduling, my personal recommendation would be to not use any inline assembler and just let the compiler/scheduler do its thing.

like image 85
Pedro Avatar answered Nov 30 '25 21:11

Pedro



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!