On the Intel Intrisics Guide for most instructions, it also has a value for both latency and throughput. Example:
__m128i _mm_min_epi32
Performance
Architecture Latency Throughput
Haswell 1 0.5
Ivy Bridge 1 0.5
Sandy Bridge 1 0.5
Westmere 1 1
Nehalem 1 1
What exactly do these numbers mean? I guess slower latency means the command takes longer to execute, but Throughput 1 for Nehalem and 0.5 for Ivy, means the command is faster on Nehalem?
Instruction throughput and latencyThroughput is the number of cycles after issue that another instruction can begin execution. Latency. Latency is the number of cycles after which the data is available for another operation.
On the other hand, Agner Fog uses the following definition for (reciprocal) throughput: "The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread."
Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD, and VIA CPUs. The latest versions of these manuals are always available from www.agner.org/optimize.
The "latency" for an instruction is how many clock cycles it takes the perform one instruction (how long does it take for the result to be ready for a dependent instruction to use it as an input). If you have a loop-carried dependency chain, you can add up the latency of the operations to find the length of the critical path.
If you have independent work in each loop iteration, out-of-order exec can overlap it. The length of that chain (in latency cycles) tells you how much hard OoO exec has to work to overlap multiple instances of that dependency chain.
Normally throughput is the number of instructions per clock cycle, but this is actually reciprocal throughput: the number of clock cycles per independent instruction start - so 0.5 clock cycles means that 2 instructions can be issued in one clock cycle and the result is ready on the next clock cycle.
Note that execution units are pipelined, all but the divider being fully pipelined (start a new instruction every clock cycle). Latency is separate from throughput (how often an independent operation can start). Many instructions are single-uop so their throughput is usually 1/n where n
is a small integer (the number of ports with an execution unit that can run that instruction).
Intel documents that here: https://software.intel.com/en-us/articles/measuring-instruction-latency-and-throughput
To find out whether two different instructions compete with each other for the same throughput resource, you need to consult a more detailed guide. For example, https://agner.org/optimize/ has instruction tables and a microarch guide. These go into detail about execution ports, and break down instructions into the three dimensions that matter: front-end cost in uops, which back-end ports, and latency.
For example, _mm_shuffle_epi8
and _mm_cvtsi32_si128
both run on port 5 on most Intel CPUs, so compete for the same 1/clock throughput. But _mm_add_epi32
runs on port 1 or port 5 on Haswell, so its 0.5c throughput only partially competes with shuffles.
https://uops.info/ has very detailed instruction tables from automated testing, including latency from each input separately to the output.
Agner Fog's tables are nice (compact and readable) but sometimes have typos or mistakes, and only a single latency number and you don't always know which input formed the dep chain.
See also What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With