As I know, GPUs switch between warps to hide the memory latency. But I wonder in which condition, a warp will be switched out? For example, if a warp perform a load, and the data is there in the cache already. So is the warp switched out or continue the next computation? What happens if there are two consecutive adds? Thanks
A GPU context is described here. It represents all the state (data, variables, conditions, etc.) that are collectively required and instantiated to perform certain tasks (e.g. CUDA compute, graphics, H. 264 encode, etc).
Context Switching involves storing the context or state of a process so that it can be reloaded when required and execution can be resumed from the same point as earlier. This is a feature of a multitasking operating system and allows a single CPU to be shared by multiple processes.
Example of Context SwitchingOne process is running state to execute its task with the use of CPUs. As the process is running, another process arrives in the ready queue, which has a high priority of completing its task using CPU.
A context switch captures the CPU state (the context) of the current running thread and pauses it, then swaps in the state of another thread so it can then resume running where it previously left off.
First of all, once a thread block is launched on a multiprocessor (SM), all of its warps are resident until they all exit the kernel. Thus a block is not launched until there are sufficient registers for all warps of the block, and until there is enough free shared memory for the block.
So warps are never "switched out" -- there is no inter-warp context switching in the traditional sense of the word, where a context switch requires saving registers to memory and restoring them.
The SM does, however, choose instructions to issue from among all resident warps. In fact, the SM is more likely to issue two instructions in a row from different warps than from the same warp, no matter what type of instruction they are, regardless of how much ILP (instruction-level parallelism) there is. Not doing so would expose the SM to dependency stalls. Even "fast" instructions like adds have a non-zero latency, because the arithmetic pipeline is multiple cycles long. On Fermi, for example, the hardware can issue 2 or more warp-instructions per cycle (peak), and the arithmetic pipeline latency is ~12 cycles. Therefore you need multiple warps in flight just to hide arithmetic latency, not just memory latency.
In general, the details of warp scheduling are architecture dependent, not publicly documented, and pretty much guaranteed to change over time. The CUDA programming model is independent of the scheduling algorithm, and you should not rely on it in your software.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With