Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

Tags:

cuda

List of Nvidia GPU - GeForce 900 Series - there is written that:

4 Single precision performance is calculated as 2 times the number of shaders multiplied by the base core clock speed.

I.e. for example for GeForce GTX 970 we can calculate performance:

1664 Cores * 1050 MHz * 2 = 3 494 GFlops peak (3 494 400 MFlops)

This value we can see in column - Processing Power (peak) GFLOPS - Single Precision.

But why we must multiple by 2?

There is written: http://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made/

SMM uses a quadrant-based design with four 32-core processing blocks each with a dedicated warp scheduler capable of dispatching two instructions per clock.

Ok, nVidia Maxwell is superscalar architecture and dispatching two instructions per clock, but can 1 CUDA-core(FP32-ALU) process more than 1 instruction per clock?

We know that 1 CUDA-Core contain two units: FP32-unit and INT-unit. But INT-unit is irrelevant to GFlops (FLoating-point Operations Per Second).

I.e. one SMM contain:

  • 128 FP32-unit
  • 128 INT-unit
  • 32 SFU-unit
  • 32 LD/ST-unit

To get preformance in GFlops we should to use only: 128 FP32-units and 32 SFU-units.

I.e. if we use both 128 FP32-units and 32 SFU-units simultaneously, then we can get 160 instructions with float-point operations per clock per 1 SM.

I.e. we must multiple by 1,2 =(160/132) instad of 2.

1664 Cores * 1050 MHz * 1,2 = 2 096 GFlops peak

Why has write in wiki that we must multiple Cores*MHz by 2?

enter image description here

enter image description here

like image 252
Alex Avatar asked Dec 14 '22 11:12

Alex


1 Answers

Summary: One FMA counts as 2 FLOPs in the standard accounting of FP throughput, even on machines that do it in a single instruction for a single execution unit (which is how it avoids intermediate rounding, the fused part of FMA).


A CUDA "core" (also called SP - streaming processor) is most commonly referring to the single-precision floating point units in an SM (streaming multiprocessor). A CUDA core can initiate one single precision floating point instruction per clock cycle. (The unit is pipelined, so it can initiate one instruction per clock, and it can retire one instruction per clock, but it cannot fully process a given instruction in a given clock cycle.)

If that instruction is for example, a single-precision add or single precision multiply, then that core can contribute one floating point operation per clock, since an add or multiply counts as one floating point operation. If, on the other hand, the instruction is an FMA instruction (floating point multiply-add) then the core will perform both a floating point multiply AND a floating point add operation in the same time period. This means that effectively two operations are performed by a single instruction. This usage of FMA gives rise to the 2 multiplier when computing peak theoretical throughput.

So a core can only process (i.e. initiate, retire) a single instruction per clock, but if that instruction is an FMA, it counts as two floating point operations.

like image 126
3 revs, 2 users 58% Avatar answered May 18 '23 20:05

3 revs, 2 users 58%