Note: This question is specific to nVIDIA Compute Capability 2.1 devices. The following information is obtained from the CUDA Programming Guide v4.1:
In compute capability 2.1 devices, each SM has 48 SP (cores) for integer and floating point operations. Each warp is composed of 32 consecutive threads. Each SM has 2 warp schedulers. At every instruction issue time, one warp scheduler picks a ready warp of threads and issues 2 instructions for the warp on the cores.
My doubts:
The Streaming Multiprocessors (SMs) of a Graphics Processing Unit (GPU) execute instructions from a group of consecutive threads, called warps. At each cycle, an SM schedules a warp from a group of active warps and can context switch among the active warps to hide various stalls.
x GPUs issues one instruction per warp every 4 cycles, and since the latency of the arithmetic pipeline is 24 cycles, it can be completely hidden by having 6 active warps at any one time.
NVIDIA GPUs execute warps of 32 parallel threads using SIMT, which enables each thread to access its own registers, to load and store from divergent addresses, and to follow divergent control flow paths.
In CUDA, groups of threads with consecutive thread indexes are bundled into warps; one full warp is executed on a single CUDA core. At runtime, a thread block is divided into a number of warps for execution on the cores of an SM. The size of a warp depends on the hardware.
This is instruction-level parallelism (ILP). The instructions issued from a warp simultaneously must be independent of each other. They are issued by the SM instruction scheduler to separate functional units in the SM.
For example, if there are two independent FMAD instructions in the warp's instruction stream that are ready to issue and the SM has two available sets of FMAD units on which to issue them, they can both be issued in the same cycle. (Instructions can be issued together in various combinations, but I have not memorized them so I won't provide details here.)
The FMAD/IMAD execution units in SM 2.1 are 16 SPs wide. This means that it takes 2 cycles to issue a warp (32-thread) instruction to one of the 16-wide execution units. There are multiple (3) of these 16-wide execution units (48 SPs total) per SM, plus special function units. Each warp scheduler can issue to two of them per cycle.
Assume the FMAD execution units are pipe_A
, pipe_B
and pipe_C
. Let us say that at cycle 135, there are two independent FMAD instructions fmad_1
and fmad_2
that are waiting:
fmad_1
to FMAD pipe_A
, and the first half warp of fmad_2
to FMAD pipe_B
. fmad_1
will have moved to the next stage in FMAD pipe_A
, and similarly the first half warp of fmad_2
will have moved to the next stage in FMAD pipe_B
. The warp scheduler now issues the second half warp of fmad_1
to FMAD pipe_A
, and the second half warp of fmad_2
to FMAD pipe_B
. So it takes 2 cycles to issue 2 instructions from the same warp. But as OP mentions there are two warp schedulers, which means this whole process can be done simultaneously for instructions from another warp (assuming there are sufficient functional units). Hence the maximum issue rate is 2 warp instructions per cycle. Note, this is an abstracted view for a programmer's perspective—the actual low-level architectural details may be different.
As for your question about when the warp will be ready next, if there are more instructions that don't depend on any outstanding (already issued but not retired) instructions, then they can be issued in the very next cycle. But as soon as the only available instructions are dependent on in-flight instructions, the warp will not be able to issue. However that is where other warps come in -- the SM can issue instructions for any resident warp that has available (non-blocked) instructions. This arbitrary switching between warps is what provides the "latency hiding" that GPUs depend on for high throughput.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With