Do NVIDIA GPUs support out-of-order execution?
My first guess is that they don't contain such expensive hardware. However, when reading the CUDA progamming guide
, the guide recommends using Instruction Level Parallelism (ILP) to improve performance.
Isn't ILP a feature that hardware supporting out-of-order execution can take advantage from? Or NVIDIA's ILP simply means compiler-level re-ordering of instructions, hence its order is still fixed at runtime. In other words, just the compiler and/or programmer has to arrange the order of instructions in such a way that ILP can be achieved at runtime through in-order executions?
Or NVIDIA's ILP simply means compiler-level re-ordering of instructions, hence its order is still fixed at runtime. In other words, just the compiler and/or programmer has to arrange the order of instructions in such a way that ILP can be achieved at runtime through in-order executions? cuda. nvidia.
A typical ILP allows multiple-cycle operations to be pipelined. Example : Suppose, 4 operations can be carried out in single clock cycle. So there will be 4 functional units, each attached to one of the operations, branch unit, and common register file in the ILP execution hardware.
In ILP there is a single specific thread of execution of a process. On the other hand, concurrency involves the assignment of multiple threads to a CPU's core in a strict alternation, or in true parallelism if there are enough CPU cores, ideally one core for each runnable thread.
parallelism or pipelining is used. The basic idea in using pipelining is to make use of multiple instructions in one clock cycle. This is possible only if there are no dependencies between the two instructions.
Pipelining is a common ILP technique and is for sure implemented on NVidia's GPU. I guess you agree that pipelining doesn't rely on out-of-order execution. Besides, NVidia GPU have multiple warp schedulers from compute capability 2.0 and beyond (2 or 4). If your code has 2 (or more) consecutive and independent instructions in threads (or compiler reorders it that way somehow), you exploit this ILP from scheduler as well.
Here is a well explained question on how 2-wide warp scheduler + pipelining work together. How do nVIDIA CC 2.1 GPU warp schedulers issue 2 instructions at a time for a warp?
Also checkout Vasily Volkov's presentation on GTC 2010. He experimentally found out how ILP would improve CUDA code performance. http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
In terms of out-of-order execution on GPU, I don't think so. Hardware instruction reordering, speculative execution all those kind of stuff are too expensive to implement per SM, as you are aware. And thread level parallelism can fill in the gap of lacking out-of-order execution. When true dependency is encountered, some other warps can kick in and fill the pipe.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With