If I have an AVX register with 4 doubles in them and I want to store the reverse of this in another register, is it possible to do this with a single intrinsic command?
For example: If I had 4 floats in a SSE register, I could use:
_mm_shuffle_ps(A,A,_MM_SHUFFLE(0,1,2,3));
Can I do this using, maybe _mm256_permute2f128_pd()
? I don't think you can address each individual double using the above intrinsic.
You actually need 2 permutes to do this:
_mm256_permute2f128_pd()
only permutes in 128-bit chunks._mm256_permute_pd()
does not permute across 128-bit boundaries.So you need to use both:
inline __m256d reverse(__m256d x){
x = _mm256_permute2f128_pd(x,x,1);
x = _mm256_permute_pd(x,5);
return x;
}
Test:
int main(){
__m256d x = _mm256_set_pd(13,12,11,10);
cout << x.m256d_f64[0] << " " << x.m256d_f64[1] << " " << x.m256d_f64[2] << " " << x.m256d_f64[3] << endl;
x = reverse(x);
cout << x.m256d_f64[0] << " " << x.m256d_f64[1] << " " << x.m256d_f64[2] << " " << x.m256d_f64[3] << endl;
}
Output:
10 11 12 13
13 12 11 10
Support for lane-crossing shuffles with granularity finer 128-bit was new with AVX2:
_mm256_permute4x64_pd(vec, _MM_SHUFFLE(0,1,2,3)); // i.e. 0b00011011
VPERMPD ymm1, ymm2/m256, imm8
runs with the same throughput and latency as other lane-crossing shuffles (like VPERM2F128
) on Intel CPUs. Also in the intrinsics finder.
On AMD Zen1 (and Excavator), vpermpd
is faster than 2-input vperm2f128
. Their vector ALUs internally are only 128-bits wide; 256-bit vector instructions are decoded into at least 2 uops, but it takes more for lane-crossing operations, especially one that can read any of 4 total lanes. (Unfortunately the decoders don't just look at the immediate bits when picking uops for vperm2f128). Manual vextractf128
/ vinsertf128
would be better than vperm2f128
on Bulldozer-family and Zen1, but that would be quite bad everywhere else. https://uops.info/. I think vpermpd
is optimal on Excavator / Zen1, 3 uops vs. at least 4 to in-lane reverse and then swap halves with vextracti128
/ vinsert128
.
There are a few CPUs with FMA3 but not AVX2, e.g. AMD Piledriver and Steamroller. On Intel, AVX2 and FMA were both new with Haswell. AMD Bulldozer-family is obsolete but still around in home computers, so even if your function takes advantage of AVX1 + FMA, your options are to also require AVX2 and have those few CPUs fall back to something even worse, (e.g. AVX1 without FMA), or to make yet another version of your function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With