GTX 4xx, 5xx (Fermi) had dynamic scheduling and GTX 6xx (Kepler) switched to static scheduling.
I assume you're referring to static/dynamic instruction scheduling in hardware.
Dynamic instruction scheduling means that the processor may re-order the individual instructions at runtime. This usually involves some bit of hardware that will try to predict the best order for whatever is in the instruction pipeline. On the GPUs you mentioned, this refers to the re-ordering of instructions for each individual warp.
The reason for switching from a dynamic scheduler back to a static scheduler is described in the GK110 Architecture Whitepaper as follows:
We also looked for opportunities to optimize the power in the SMX warp scheduler logic. For example, both Kepler and Fermi schedulers contain similar hardware units to handle the scheduling function, including:
Register scoreboarding for long latency operations (texture and load)
Inter‐warp scheduling decisions (e.g., pick the best warp to go next among eligible candidates)
Thread block level scheduling (e.g., the GigaThread engine)
However, Fermi’s scheduler also contains a complex hardware stage to prevent data hazards in the math datapath itself. A multi‐port register scoreboard keeps track of any registers that are not yet ready with valid data, and a dependency checker block analyzes register usage across a multitude of fully decoded warp instructions against the scoreboard, to determine which are eligible to issue.
For Kepler, we recognized that this information is deterministic (the math pipeline latencies are not variable), and therefore it is possible for the compiler to determine up front when instructions will be ready to issue, and provide this information in the instruction itself. This allowed us to replace several complex and power‐expensive blocks with a simple hardware block that extracts the pre‐determined latency information and uses it to mask out warps from eligibility at the inter‐warp scheduler stage.
So basically, they're trading chip complexity, i.e. a simpler scheduler, for efficiency. But that potentially lost efficiency is now picked up by the compiler which can predict the best order, at least for the math pipeline.
As for your final question, i.e. what can be done in code to optimize an algorithm for static or dynamic scheduling, my personal recommendation would be to not use any inline assembler and just let the compiler/scheduler do its thing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With