In the Intel intrinsics webapp, several operations seem to have worsened from Sandy Bridge to Haswell. For example, many insert operations like _mm256_insertf128_si256 show a cost table like the following:
Performance
Architecture Latency Throughput
Haswell 3 -
Ivy Bridge 1 -
Sandy Bridge 1 -
I found this difference puzzling. Is this difference because there are new instructions that replace these ones or something that compensates for it (which ones)? Does anyone know if Skylake changes this model further?
TL:DR: all lane-crossing shuffles / inserts / extracts have 3c latency on Haswell/Skylake, but 2c latency on SnB/IvB, according to Agner Fog's testing.
This is probably 1c in the execution unit + an unavoidable bypass delay of some sort, because the actual execution units in SnB through Broadwell have standardized latencies of 1, 3, or 5 cycles, never 2 or 4 cycles. (SKL makes some uops uops 4c, including FMA/ADDPS/MULPS).
(Note that on AMD CPUs that do AVX1 with 128b ALUs (e.g. Bulldozer/Piledriver/Steamroller), insert128/extract128 are much faster than shuffles like VPERM2F128.)
The intrinsics guide has bogus data sometimes. I assume it's meant to be for the reg-reg form of instructions, except in the case of the load intrinsics. Even when it's correct, the intrinsics guide doesn't give a very detailed picture of performance; see below for discussion of Agner Fog's tables/guides.
(One of my pet peeves with intrinsics is that it's hard to use PMOVZX
/ PMOVSX
as a load, because the only intrinsics provided take a __m128i
source, even though pmovzxbd
only loads 4B or 8B (ymm). It and/or broadcast-loads (_mm_set1_*
with AVX1/2) are great way to compress constants in memory. There should be intrinsics that take a const char*
(because that's allowed to alias anything)).
In this case, Agner Fog's measurements show that SnB/IvB have 2c latency for reg-reg vinsertf128
/vextractf128
, while his measurements for Haswell (3c latency, one per 1c tput) agree with Intel's table. So it's another case where the numbers in Intel's intrinsics guide are wrong. It's great for finding the right intrinsic, but not a good source for reliable performance numbers. It doesn't tell you anything about execution ports or total uops, and often omits even the throughput numbers. Latency is often not the limiting factor in vector integer code anyway. This is probably why Intel let the latencies increase for Haswell.
The reg-mem form is significantly different. vinsertf128 y,y,m,i
has lat/recip-tput of: IvB:4/1, Haswell/BDW:4/2, SKL:5/0.5. It's always a 2-uop instruction (fused domain), using one ALU uop. IDK why the throughput is so different. Maybe Agner tested slightly differently?
Interestingly, vextractf128 mem,reg, i
doesn't use any ALU uops. It's a 2-fused-domain-uop instruction that only uses the store-data and store-address ports, not the shuffle unit. (Agner Fog's table lists it as using one p015 uop on SnB, 0 on IvB. But even on SnB, doesn't have a mark in any specific column, so IDK which one is right.)
It's silly that vextractf128
wastes a byte on an immediate operand. I guess they didn't know they were going to use EVEX for the next vector length extension, and were preparing for the immediate to go from 0..3. But for AVX1/2, you should never use that instruction with the immediate = 0. Instead, just movups mem, xmm
or movaps xmm,xmm
. (I think compilers know this, and do that when you use the intrinsic with index = 0, like they do for _mm_extract_epi32
and so on (movd
).)
Latency is more often a factor in FP code, and Skylake is a monster for FP ALUs. They managed to drop the latency for FMA to 4 cycles, so mulps/addps/fma...ps are all 4c latency with one per 0.5c throughput. (Broadwell was mulps/addps = 3c latency, fma = 5c latency. Haswell was addps=3c latency, mul/fma=5c). Skylake dropped the separate add unit, so addps actually worsened from 3c to 4c, but with twice the throughput. (Haswell/BDW only did addps with one per 1c throughput, half that of mul/fma.) So using many vector accumulators is essential in most FP algorithms for keeping 8 or 10 FMAs in flight at once to saturate the throughput, if there's a loop-carried dependency. Otherwise if the loop body is small enough, out-of-order execution will have multiple iterations in flight at once.
Integer in-lane ops are typically only 1c latency, so you need a much smaller amount of parallelism to max out the throughput (and not be limited by latency).
vperm2f128
or AVX2 vpermps
are more expensive. Going through memory will cause a store-forwarding failure -> big latency for insert (2 narrow stores -> wide load), so it's obviously bad. Don't try to avoid vinsertf128
in cases where it's useful.
As always, try to use the cheapest instruction sequences possible. e.g. for a horizontal sum or other reduction, always reduce down to a 128b vector first, because cross-lane shuffles are slow. Usually it's just vextractf128
/ addps xmm
, then the usual horizontal 128b.
As Mysticial alluded to, Haswell and later have half the in-lane vector shuffle throughput of SnB/IvB for 128b vectors. SnB/IvB can pshufb
/ pshufd
with one per 0.5c throughput, but only one per 1c for shufps
(even the 128b version); same for other shuffles that have a ymm version in AVX1 (e.g. vpermilps
, which apparently exists only so FP load-and-shuffle can be done in one instruction). Haswell got rid of the 128b shuffle unit on port1 altogether, instead of widening it for AVX2.
Agner Fog's guides/insn tables were updated in December to include Skylake. See also the x86 tag wiki for more links. The reg,reg form has the same performance as in Haswell/Broadwell.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With