Is there an intrinsic or another efficient way for repacking high/low 32-bit components of 64-bit components of AVX register into an SSE register? A solution using AVX2 is ok.
So far I'm using the following code, but profiler says it's slow on Ryzen 1800X:
// Global constant
const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);
// ...
// function code
__m256i x = /* computed here */;
const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(x),
gHigh32Permute); // This seems to take 3 cycles
On Intel, your code would be optimal. One 1-uop instruction is the best you will get. (Except you might want to use vpermps
to avoid any risk for int / FP bypass delay, if your input vector was created by a pd
instruction rather than a load or something. Using the result of an FP shuffle as an input to integer instructions is usually fine on Intel, but I'm less sure about feeding the result of an FP instruction to an integer shuffle.)
Although if tuning for Intel, you might try changing the surrounding code so you can shuffle into the bottom 64-bits of each 128b lane, to avoid using a lane-crossing shuffle. (Then you could just use vshufps ymm
, or if tuning for KNL, vpermilps
since 2-input vshufps
is slower.)
With AVX512, there's _mm256_cvtepi64_epi32
(vpmovqd
) which packs elements across lanes, with truncation.
On Ryzen, lane-crossing shuffles are slow. Agner Fog doesn't have numbers for vpermd
, but he lists vpermps
(which probably uses the same hardware internally) at 3 uops, 5c latency, one per 4c throughput.
vextractf128 xmm, ymm, 1
is very efficient on Ryzen (1c latency, 0.33c throughput), not surprising since it tracks 256b registers as two 128b halves already. shufps
is also efficient (1c latency, 0.5c throughput), and will let you shuffle the two 128b registers into the result you want.
This also saves you 2 registers for the 2 vpermps
shuffle masks you don't need anymore.
So I'd suggest:
__m256d x = /* computed here */;
// Tuned for Ryzen. Sub-optimal on Intel
__m128 hi = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));
__m128 lo = _mm_castpd_ps(_mm256_castpd256_pd128(x));
__m128 odd = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(3,1,3,1));
__m128 even = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(2,0,2,0));
On Intel, using 3 shuffles instead of 2 gives you 2/3rds of the optimal throughput, with 1c extra latency for the first result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With