I am porting SSE SIMD code to use the 256 bit AVX extensions and cannot seem to find any instruction that will blend/shuffle/move the high 128 bits and the low 128 bits.
The backing story:
What I really want is VHADDPS
/_mm256_hadd_ps
to act like HADDPS
/_mm_hadd_ps
, only with 256 bit words. Unfortunately, it acts like two calls to HADDPS
acting independently on the low and high words.
Using VPERM2F128, one can swap the low 128 and high 128 bits ( as well as other permutations). The instrinsic function usage looks like
x = _mm256_permute2f128_ps( x , x , 1)
The third argument is a control word which gives the user a lot of flexibility. See the Intel Instrinsic Guide for details.
x = _mm256_permute4x64_epi64(x, 0b01'00'11'10);
Read about it here. And Try it online!
Note: This instruction needs AVX2 (not just AVX1).
As commented by @PeterCordes speed-wise on Zen2 / Zen3 CPUs _mm256_permute2x128_si256(x, x, i) is the best option, even though it has 3 arguments compared to function _mm256_permute4x64_epi64(x, i) suggested by me having 2 arguments. On Zen1 and KNL/KNM (and Bulldozer-family Excavator), _mm256_permute4x64_epi64(x, i) suggested by me is more efficient. On other CPUs (including mainstream Intel), both choices are equal.
As already said both _mm256_permute2x128_si256(x, y, i)
and _mm256_permute4x64_epi64(x, i)
need AVX2, while _mm256_permute2f128_si256(x, i)
needs just AVX1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With