With recent NVIDIA micro-architectures, there's a new (?) taxonomy of warp stall reasons / warp scheduler states.
Two of the items in this taxonomy are:
where, I presume, "scoreboard" is used the sense of out-of-order execution data dependency tracking (see e.g. here).
My questions:
The NVIDIA GPU has two classification of instructions:
The Short Scoreboard and Long Scoreboard are reported on instructions dependent on data returned from a variable latency instruction. Short scoreboards are reported for dependencies coming for variable latency instructions that will not leave the SM such as slow math such as reciprocal sqrt or shared memory). Long scoreboards are reported for dependencies that may leave the SM such as global/local memory accesses and texture fetches.
Detailed descriptions from the Nsight Cmpute v2020.3.1 Kernel Profiling Guide
Long Scoreboard
Warp was stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, tex) operation. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality, or by changing the cache configuration, and consider moving frequently used data to shared memory.
Short Scoreboard
Warp was stalled waiting for a scoreboard dependency on a MIO (memory input/output) operation (not to L1TEX). The primary reason for a high number of stalls due to short scoreboards is typically memory operations to shared memory. Other reasons include frequent execution of special math instructions (e.g. MUFU) or dynamic branching (e.g. BRX, JMX). Verify if there are shared memory operations and reduce bank conflicts, if applicable.
MIO vs. L1TEX
MIO and L1TEX are partitions in the NVIDIA SM. The MIO units is responsible for shared execution units (shared by 1 or more SM sub-partitions) including lower rate math units (e.g. double precision on a GeForce chip) and memory input/output. The memory subsystems contains L1, TEX unit, shared memory unit, and other domain specific (e.g. graphics) interfaces to the SM. The implementation of the MIO subsystem including L1, TEX, and shared memory varies greatly between Kepler, Maxwell-Pascal, and Volta-Ampere. SM sub-partitions (warp schedulers) issues instructions to shared execution units through instruction queues vs. direct dispatch. For SM 7.0+ there are stall reasons (mio_throttle, lg_throttle, and tex_throttle) that occur if the instruction queues for those units are full.
What is included in the definition of MIO varies by architecture. L1TEX is technically in the MIO partition. The L1TEX has is complicated as it has two input interfaces:
The term MIO can be confusing. The term L1TEX can also be confusing given two different interfaces. While there are two interfaces local/global and texture/surface share the same cache lookup stages, same cache RAM, and same SM to L2 interface so for many metrics the term L1TEX is used to refer to the unit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With