SIMD latency throughput

Tags:

On the Intel Intrisics Guide for most instructions, it also has a value for both latency and throughput. Example:

__m128i _mm_min_epi32

Performance
Architecture Latency Throughput
Haswell      1       0.5
Ivy Bridge   1       0.5
Sandy Bridge 1       0.5
Westmere     1       1
Nehalem      1       1

What exactly do these numbers mean? I guess slower latency means the command takes longer to execute, but Throughput 1 for Nehalem and 0.5 for Ivy, means the command is faster on Nehalem?

264

asked Feb 15 '15 23:02

Alexandros

1 Answers

The "latency" for an instruction is how many clock cycles it takes the perform one instruction (how long does it take for the result to be ready for a dependent instruction to use it as an input). If you have a loop-carried dependency chain, you can add up the latency of the operations to find the length of the critical path.

If you have independent work in each loop iteration, out-of-order exec can overlap it. The length of that chain (in latency cycles) tells you how much hard OoO exec has to work to overlap multiple instances of that dependency chain.

Normally throughput is the number of instructions per clock cycle, but this is actually reciprocal throughput: the number of clock cycles per independent instruction start - so 0.5 clock cycles means that 2 instructions can be issued in one clock cycle and the result is ready on the next clock cycle.

Note that execution units are pipelined, all but the divider being fully pipelined (start a new instruction every clock cycle). Latency is separate from throughput (how often an independent operation can start). Many instructions are single-uop so their throughput is usually 1/n where n is a small integer (the number of ports with an execution unit that can run that instruction).

Intel documents that here: https://software.intel.com/en-us/articles/measuring-instruction-latency-and-throughput

To find out whether two different instructions compete with each other for the same throughput resource, you need to consult a more detailed guide. For example, https://agner.org/optimize/ has instruction tables and a microarch guide. These go into detail about execution ports, and break down instructions into the three dimensions that matter: front-end cost in uops, which back-end ports, and latency.

For example, _mm_shuffle_epi8 and _mm_cvtsi32_si128 both run on port 5 on most Intel CPUs, so compete for the same 1/clock throughput. But _mm_add_epi32 runs on port 1 or port 5 on Haswell, so its 0.5c throughput only partially competes with shuffles.

https://uops.info/ has very detailed instruction tables from automated testing, including latency from each input separately to the output.

Agner Fog's tables are nice (compact and readable) but sometimes have typos or mistakes, and only a single latency number and you don't always know which input formed the dep chain.

See also What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

answered Oct 21 '22 00:10

Mats Petersson

Related questions
                            
                                What is the type of a string literal in C++? [duplicate]
                            
                                What's the difference between a tuple and a compressed_pair?
                            
                                When open an image using imread with no second argument, what is the color mode? BGR or RGB?
                            
                                Declaration is incompatible with type
                            
                                CoInitialize() has not been called exceptions in C++
                            
                                Main differences between threading with shared memory and MPI?
                            
                                Quickly convert raw data to hex string in c++
                            
                                C++ Warning: anonymous type with no linkage used to declare variable
                            
                                Variadic Function Accepting Functors/Callable Objects
                            
                                Why are char[] and char* as typedefs different, but sometimes... not?
                            
                                difference between Strings in C++ and Java
                            
                                How to properly hash the custom struct?
                            
                                Does the GLM library for OpenGL not have a magnitude function?
                            
                                protocol buffers cpp embedded message
                            
                                How to make the lambda a friend of a class?
                            
                                C++ concepts vs static_assert
                            
                                Concatenate char arrays in C++
                            
                                boost::asio::spawn yield as callback
                            
                                Can mutex-locking function be marked as const
                            
                                std::future as a parameter to a function C++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SIMD latency throughput

Tags:

c++

performance

x86

simd

sse

Alexandros

People also ask

1 Answers

Mats Petersson

Recent Activity

Donate For Us