Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reverse a AVX register containing doubles using a single AVX intrinsic

If I have an AVX register with 4 doubles in them and I want to store the reverse of this in another register, is it possible to do this with a single intrinsic command?

For example: If I had 4 floats in a SSE register, I could use:

_mm_shuffle_ps(A,A,_MM_SHUFFLE(0,1,2,3));

Can I do this using, maybe _mm256_permute2f128_pd()? I don't think you can address each individual double using the above intrinsic.

like image 700
user1715122 Avatar asked Dec 07 '22 11:12

user1715122


2 Answers

You actually need 2 permutes to do this:

  • _mm256_permute2f128_pd() only permutes in 128-bit chunks.
  • _mm256_permute_pd() does not permute across 128-bit boundaries.

So you need to use both:

inline __m256d reverse(__m256d x){
    x = _mm256_permute2f128_pd(x,x,1);
    x = _mm256_permute_pd(x,5);
    return x;
}

Test:

int main(){
    __m256d x = _mm256_set_pd(13,12,11,10);

    cout << x.m256d_f64[0] << "  " << x.m256d_f64[1] << "  " << x.m256d_f64[2] << "  " << x.m256d_f64[3] << endl;
    x = reverse(x);
    cout << x.m256d_f64[0] << "  " << x.m256d_f64[1] << "  " << x.m256d_f64[2] << "  " << x.m256d_f64[3] << endl;
}

Output:

10  11  12  13
13  12  11  10
like image 158
Mysticial Avatar answered Dec 09 '22 00:12

Mysticial


Support for lane-crossing shuffles with granularity finer 128-bit was new with AVX2:

_mm256_permute4x64_pd(vec, _MM_SHUFFLE(0,1,2,3));  // i.e. 0b00011011

VPERMPD ymm1, ymm2/m256, imm8 runs with the same throughput and latency as other lane-crossing shuffles (like VPERM2F128) on Intel CPUs. Also in the intrinsics finder.

On AMD Zen1 (and Excavator), vpermpd is faster than 2-input vperm2f128. Their vector ALUs internally are only 128-bits wide; 256-bit vector instructions are decoded into at least 2 uops, but it takes more for lane-crossing operations, especially one that can read any of 4 total lanes. (Unfortunately the decoders don't just look at the immediate bits when picking uops for vperm2f128). Manual vextractf128 / vinsertf128 would be better than vperm2f128 on Bulldozer-family and Zen1, but that would be quite bad everywhere else. https://uops.info/. I think vpermpd is optimal on Excavator / Zen1, 3 uops vs. at least 4 to in-lane reverse and then swap halves with vextracti128 / vinsert128.


There are a few CPUs with FMA3 but not AVX2, e.g. AMD Piledriver and Steamroller. On Intel, AVX2 and FMA were both new with Haswell. AMD Bulldozer-family is obsolete but still around in home computers, so even if your function takes advantage of AVX1 + FMA, your options are to also require AVX2 and have those few CPUs fall back to something even worse, (e.g. AVX1 without FMA), or to make yet another version of your function.

like image 21
Peter Cordes Avatar answered Dec 08 '22 23:12

Peter Cordes