Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I exchange the low 128 bits and high 128 bits in a 256 bit AVX (YMM) register

Tags:

I am porting SSE SIMD code to use the 256 bit AVX extensions and cannot seem to find any instruction that will blend/shuffle/move the high 128 bits and the low 128 bits.

The backing story:

What I really want is VHADDPS/_mm256_hadd_ps to act like HADDPS/_mm_hadd_ps, only with 256 bit words. Unfortunately, it acts like two calls to HADDPS acting independently on the low and high words.

like image 979
Mark Borgerding Avatar asked Aug 26 '11 20:08

Mark Borgerding


2 Answers

Using VPERM2F128, one can swap the low 128 and high 128 bits ( as well as other permutations). The instrinsic function usage looks like

x = _mm256_permute2f128_ps( x , x , 1) 

The third argument is a control word which gives the user a lot of flexibility. See the Intel Instrinsic Guide for details.

like image 51
Mark Borgerding Avatar answered Oct 04 '22 14:10

Mark Borgerding


x = _mm256_permute4x64_epi64(x, 0b01'00'11'10); 

Read about it here. And Try it online!

Note: This instruction needs AVX2 (not just AVX1).

As commented by @PeterCordes speed-wise on Zen2 / Zen3 CPUs _mm256_permute2x128_si256(x, x, i) is the best option, even though it has 3 arguments compared to function _mm256_permute4x64_epi64(x, i) suggested by me having 2 arguments. On Zen1 and KNL/KNM (and Bulldozer-family Excavator), _mm256_permute4x64_epi64(x, i) suggested by me is more efficient. On other CPUs (including mainstream Intel), both choices are equal.

As already said both _mm256_permute2x128_si256(x, y, i) and _mm256_permute4x64_epi64(x, i) need AVX2, while _mm256_permute2f128_si256(x, i) needs just AVX1.

like image 31
Arty Avatar answered Oct 04 '22 14:10

Arty