As the answer of this question shows, when a SM contains 8 CUDA cores(Compute Capability 1.3), a single warp of 32 threads takes 4 clock cycles to execute a single instruction for the whole warp.
That is lane 1 to lane 8 of the warp concurrently running on the 8 cores, then lane 9 to lane 16 running,after that lane 17 to lane 24, finally lane 25 to lane 32.
Do I understand this correctly?
So my question is, on new devices,there are 32 (Compute Capability 2.0) or 48 (2.1) or 192 (3.0, Kepler) CUDA cores per SM, but the warp size is still 32.
CUDA cores is the number of single precision floating point units in the SM. The SM has other execution units including special function units (RSQRT, COS, SIN, ...), double precision units, load store units, texture units, branch unit, etc.
The Fermi, Kepler-gk10x, Kepler-gk110 and Maxwell whitepapers contain additional information on the type and number of execution units in the SMs.
The instruction throughput of Arithmetic Instructions can be found in the CUDA Programming Guide in the Table of Throughput of Arithmetic Instructions.
As a developer you want to understand the rate an SM can issue instructions which is documented in the throughput table. The rate is determine by the throughput of the warp schedulers as well as the throughput of the execution units (again, not just the CUDA cores).
CC1.x Tesla
CC2.x Fermi
CC3.* Kepler CC5.0 Maxwell
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With