Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient (on Ryzen) way to extract the odd elements of a __m256 into a __m128?

Is there an intrinsic or another efficient way for repacking high/low 32-bit components of 64-bit components of AVX register into an SSE register? A solution using AVX2 is ok.

So far I'm using the following code, but profiler says it's slow on Ryzen 1800X:

// Global constant
const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);

// ...

// function code
__m256i x = /* computed here */;
const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(x),
  gHigh32Permute); // This seems to take 3 cycles
like image 573
Serge Rogatch Avatar asked Aug 24 '17 16:08

Serge Rogatch


1 Answers

On Intel, your code would be optimal. One 1-uop instruction is the best you will get. (Except you might want to use vpermps to avoid any risk for int / FP bypass delay, if your input vector was created by a pd instruction rather than a load or something. Using the result of an FP shuffle as an input to integer instructions is usually fine on Intel, but I'm less sure about feeding the result of an FP instruction to an integer shuffle.)

Although if tuning for Intel, you might try changing the surrounding code so you can shuffle into the bottom 64-bits of each 128b lane, to avoid using a lane-crossing shuffle. (Then you could just use vshufps ymm, or if tuning for KNL, vpermilps since 2-input vshufps is slower.)

With AVX512, there's _mm256_cvtepi64_epi32 (vpmovqd) which packs elements across lanes, with truncation.


On Ryzen, lane-crossing shuffles are slow. Agner Fog doesn't have numbers for vpermd, but he lists vpermps (which probably uses the same hardware internally) at 3 uops, 5c latency, one per 4c throughput.

vextractf128 xmm, ymm, 1 is very efficient on Ryzen (1c latency, 0.33c throughput), not surprising since it tracks 256b registers as two 128b halves already. shufps is also efficient (1c latency, 0.5c throughput), and will let you shuffle the two 128b registers into the result you want.

This also saves you 2 registers for the 2 vpermps shuffle masks you don't need anymore.

So I'd suggest:

__m256d x = /* computed here */;

// Tuned for Ryzen.  Sub-optimal on Intel
__m128 hi = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));
__m128 lo = _mm_castpd_ps(_mm256_castpd256_pd128(x));
__m128 odd  = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(3,1,3,1));
__m128 even = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(2,0,2,0));

On Intel, using 3 shuffles instead of 2 gives you 2/3rds of the optimal throughput, with 1c extra latency for the first result.

like image 198
Peter Cordes Avatar answered Oct 01 '22 13:10

Peter Cordes