Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why both? vperm2f128 (avx) vs vperm2i128 (avx2)

Tags:

avx

intel

simd

avx2

avx introduced the instruction vperm2f128 (exposed via _mm256_permute2f128_si256), while avx2 introduced vperm2i128 (exposed via _mm256_permute2x128_si256).

They both seem to be doing exactly the same, and their respective latencies and throughputs also seem to be identical.

So why do both instructions exist? There has to be some reasoning behind that? Is there maybe something I have overlooked? Given that avx2 operates on data structures introduced with avx, I cannot imagine that a processor will ever exist that supports avx2 but not avx.

like image 952
mSSM Avatar asked Dec 07 '18 11:12

mSSM


People also ask

What is the difference between AVX and AVX2?

The only difference between AVX and AVX2 for floating point code is availability of new FMA instruction – both AVX and AVX2 have 256-bit FP registers. The main advantage of new ISA of AVX2 is for integer code/data types – there you can expect up to 2x speedup, but 8% for FP code is good speedup of AVX2 over AVX.

How many AVX2 registers?

The number of architectural YMM registers is 16 for 64-bit AVX2, and 32 for 64-bit AVX512VL. In 32-bit code, there are always only 8 vector registers available, even with AVX512.

What is AVX2 support?

AVX2 provides extensions to the x86 instruction set architecture. This is a Single Instruction Multiple Data (SIMD) instruction set that enables running a set of highly parallelizable operations simultaneously. AVX2 allows CPUs to perform a higher number of integer and floating-point operations per clock cycle.

What does AVX mean CPU?

Advanced Vector Extensions (AVX, also known as Sandy Bridge New Extensions) are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge processor shipping in Q1 2011 and later on by AMD in Q3 2011.


1 Answers

There's a bit of a disconnect between the intrinsics and the actual instructions that are underneath.

AVX:

All 3 of these generate exactly the same instruction, vperm2f128:

  • _mm256_permute2f128_pd()
  • _mm256_permute2f128_ps()
  • _mm256_permute2f128_si256()

The only difference are the types - which don't exist at the instruction level.

vperm2f128 is a 256-bit floating-point instruction. In AVX, there are no "real" 256-bit integer SIMD instructions. So even though _mm256_permute2f128_si256() is an "integer" intrinsic, it's really just syntax sugar for this:

_mm256_castpd_si256(
    _mm256_permute2f128_pd(
        _mm256_castsi256_pd(x),
        _mm256_castsi256_pd(y),
        imm
    )
);

Which does a round trip from the integer domain to the FP domain - thus incurring bypass delays. As ugly as this looks, it is only way to do it in AVX-only land.

vperm2f128 isn't the only instruction to get this treatment, I find at least 3 of them:

  • vperm2f128 / _mm256_permute2f128_si256()
  • vextractf128 / _mm256_extractf128_si256()
  • vinsertf128 / _mm256_insertf128_si256()

Together, it seems that the usecase of these intrinsics is to load data as 256-bit integer vectors, and shuffle them into multiple 128-bit integer vectors for integer computation. Likewise the reverse where you store as 256-bit vectors.

Without these "hack" intrinsics, you would need to use a lot of cast intrinsics.

Either way, a competent compiler will try to optimize the types as well. Thus it will generate floating-point load/stores and shuffles even if you are using 256-bit integer loads. This reduces the number of bypass delays to only one layer. (when you go from FP-shuffle to 128-bit integer computation)


AVX2:

AVX2 cleans up this madness by adding proper 256-bit integer SIMD support for everything - including the shuffles.

The vperm2i128 instruction is new along with a new intrinsic for it, _mm256_permute2x128_si256().

This, along with _mm256_extracti128_si256() and _mm256_inserti128_si256() lets you do 256-bit integer SIMD and actually stay completely in the integer domain.


The distinction between integer FP versions of the same instructions has to do with bypass delays. In older processors, there were delays to move data from int <-> FP domains. While the SIMD registers themselves are type-agnostic, the hardware implementation isn't. And there is extra latency to get data output by an FP instruction to an input to an integer instruction. (and vice versa)

Thus it was important (from a performance standpoint) to use the correct instruction type to match the actual datatype that was being operated on.

On the newest processors (Skylake and later?), there doesn't seem to be anymore int/FP bypass delays with regards to the shuffle instructions. While the instruction set still has this distinction, shuffle instructions that do the same thing with different "types" probably map to the same uop now.

like image 94
Mysticial Avatar answered Oct 03 '22 15:10

Mysticial