Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do the warps schedule on CUDA SMs?

Tags:

cuda

As the answer of this question shows, when a SM contains 8 CUDA cores(Compute Capability 1.3), a single warp of 32 threads takes 4 clock cycles to execute a single instruction for the whole warp.

That is lane 1 to lane 8 of the warp concurrently running on the 8 cores, then lane 9 to lane 16 running,after that lane 17 to lane 24, finally lane 25 to lane 32.

Do I understand this correctly?

So my question is, on new devices,there are 32 (Compute Capability 2.0) or 48 (2.1) or 192 (3.0, Kepler) CUDA cores per SM, but the warp size is still 32.

  • How do the warp schedule on these new SMs?
  • Do the lane 1 to lane 32 running together or like the above mentioned lane 1 to lane 8, lane 9 to lane 16,... on the old CUDA SMs?
like image 331
Danny Zhu Avatar asked May 08 '14 02:05

Danny Zhu


1 Answers

CUDA cores is the number of single precision floating point units in the SM. The SM has other execution units including special function units (RSQRT, COS, SIN, ...), double precision units, load store units, texture units, branch unit, etc.

The Fermi, Kepler-gk10x, Kepler-gk110 and Maxwell whitepapers contain additional information on the type and number of execution units in the SMs.

The instruction throughput of Arithmetic Instructions can be found in the CUDA Programming Guide in the Table of Throughput of Arithmetic Instructions.

As a developer you want to understand the rate an SM can issue instructions which is documented in the throughput table. The rate is determine by the throughput of the warp schedulers as well as the throughput of the execution units (again, not just the CUDA cores).

CC1.x Tesla

  • 1 warp scheduler per SM
  • Each warp scheduler selects 1 eligible warp and issues 1 instruction per 4 cycles.

CC2.x Fermi

  • 2 warp schedulers per SM
  • CC2.0 Each warp scheduler selects 1 eligible warp per tepid clock and issues 1 instruction.
  • CC2.x Each warp scheduler selects 1 eligible warp per tepid clock and issues up to 2 independent instructions.
  • The math pipes run at hot clock (2x tepid clock). This often results in people stating that instructions are issued over 2 clock cycles. It easier to think in terms of tepid clock.

CC3.* Kepler CC5.0 Maxwell

  • 4 warp schedulers per SM
  • Each warp scheduler selects 1 eligible warp and issues up to 2 independent instructions.
like image 137
Greg Smith Avatar answered Oct 12 '22 10:10

Greg Smith