Compute Workload Analysis displays the utilization of different compute pipelines. I know that in a modern GPU, integer and floating point pipelines are different hardware units and can execute in parallel. However, it is not very clear which pipeline represents which hardware unit for the other pipelines. I also couldn't find any documentation online about abbreviations and interpretations of the pipelines.
My questions are:
1) What are the full names of ADU, CBU, TEX, XU? How do they map to the hardware?
2) Which of the pipelines utilize the same hardware unit(e.g. FP16, FMA, FP64 uses floating point unit)?
3) A warp scheduler in a modern GPU can schedule 2 instructions per cycle(using different pipelines). Which pipelines can be used at the same time(e.g FMA-ALU, FMA-SFU, ALU-Tensor etc.)?
P.s.: I am adding the screenshot for those who are not familiar with Nsight Compute.
The Volta (CC 7.0) and Turing (CC 7.5) SM is comprised of 4 sub-partitions (SMSP). Each sub-partition contains
- warp scheduler
- register file
- immediate constant cache
- execution units
- ALU, FMA, FP16, UDP (7.5+), and XU
- FP64 on compute centric parts (GV100)
- Tensor units
The contains several other partitions that contains execution units and resources shared by the 4 sub-partitions including
- instruction cache
- index constant cache
- L1 data cache that is partitioned into tagged RAM and shared memory
- execution units
- ADU, LSU, TEX
- On non-compute parts FP64 and Tensor may be implemented as a shared execution unit
In Volta (CC7.0, 7.2) and Turing (CC7.5) each SM sub-partition can issue 1 instruction per cycle. The instruction can be issued to a local execution unit or the SM shared execution units.
-
ADU - Address Divergence Unit. The ADU is reponsible per thread address divergence handling for branches/jumps and indexed constant loads prior to instructions being forwarded to other execution units.
-
ALU - Arithmetic Logic Unit. The ALU is responsible for execution of most integer instructions, bit manipulation instructions, and logic instructions.
-
CBU - Convergence Barrier Unit. The CBU is repsonsible for barrier, convergence, and branch instructions.
-
FMA - Floating point Multiply and Accumulate Unit. The FMA is responsible for most FP32 instructions, integer multiply and accumulate instructions, and integer dot product.
-
FP16 - Paired half-precision floating point unit. The FP16 unit is responisble for execution of paired half-precision floating point instructions.
-
FP64 - Double precision floating point unit. The FP64 unit is responsible for all FP64 instructions. FP64 is often implemented as several different pipes on NVIDIA GPUs. The throughput varies greatly per chip.
-
LSU - Load Store Unit. The LSU is responsible for load, store and atomic instructions to global, local, and shared memory.
-
Tensor (FP16) - Half-precision floating point matrix multiply and accumulate unit.
-
Tensor (INT) - Integer matrix multiply and accumulate unit.
-
TEX - Texture Unit. The texture unit is responsible for sampling, load, and filtering instructions on textures and surfaces.
-
UDP (Uniform) - Uniform Data Path - A scalar unit used to execute instructions where input and output is identical for all threads in a warp.
-
XU - Transcendental and Data Type Conversion Unit - The XU is responsible for special functions such as sin, cos, and reciprocal square root as well as data type conversions.