Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the inverse of "_mm256_cvtepi16_epi32"

I want an AVX2 (or earlier) intrinsic that will convert an 8-wide 32-bit integer vector (256 bits total) into 8-wide 16-bit integer vector (128 bits total) [discarding the upper 16-bits of each element]. This should be the inverse of "_mm256_cvtepi16_epi32". If there is not a direct instruction, how should I best do this with a sequence of instructions?

like image 872
Steve Burns Avatar asked Apr 08 '18 19:04

Steve Burns


1 Answers

There is no single-instruction inverse until AVX512F. __m128i _mm256_cvtepi32_epi16(__m256i a) (VPMOVDW), also available for 512->256 or 128->low_half_of_128. (The versions with inputs smaller than a 512-bit ZMM register also require AVX512VL, so only Skylake-X, not Xeon Phi KNL).

There are signed/unsigned saturation versions of that AVX512 instruction, but only AVX512 has a pack instruction that truncates (discarding the upper bytes of each element) instead of saturating.

Or with AVX512BW, you could emulate a lane-crossing 2-input pack using vpermi2w to produce a 512-bit result from two 512-bit input vectors. On Skylake-AVX512, it decodes to multiple shuffle uops, but so does VPMOVDW, which is also a lane-crossing shuffle with granularity less than dword (32-bit). http://instlatx64.atw.hu/ has a spreadsheet of SKX uops / ports.


The SSE2/AVX2 pack instructions like _mm256_packus_epi32 (vpackusdw) do signed or unsigned saturation, as well as operating within each 128-bit lane. This is unlike the lane-crossing behaviour of vpmovzxwd.

You could _mm256_and_si256 to clear the high bytes before packing, though. That could be good if you have multiple input vectors, because packs_epi32 takes 2 input vectors and produces a 256-bit output.

a = H G F E | D C B A    32-bit signed elements, shown from high element to low element, low 128-bit lane on the right
b = P O N M | L K J I

_mm256_packus_epi32(a, b)   16-bit unsigned elements
    P O N M H G F E  |  L K J I D C B A
      elements from first operand go to the low half of each lane

If you can make efficient use of 2x vpand / vpackuswd ymm / vpermq ymm to get a 256-bit vector with all the elements in the right order, then that's probably best on Intel CPUs. Only 2 shuffle uops (4 total uops) per 256 bits of results, and you get them in a single vector.


Or you can use SSSE3 / AVX2 vpshufb (_mm256_shuffle_epi8) to extract the bytes you want from a single input, and zero the other half of each 128-bit lane (by setting the shuffle-control value for that element to have the sign bit set). Then use AVX2 vpermq to shuffle data from the two lanes into just the low 128.

__m256i trunc_elements = _mm256_shuffle_epi8(res256, shuffle_mask_32_to_16);
__m256i ordered = _mm256_permute4x64_epi64(trunc_elements, 0x58);
__m128i result  = _mm256_castsi256_si128(ordered);   // no asm instructions

So this is 2 uops per 128 bits of results, but both of the uops are shuffles that run only on port 5 on mainstream Intel CPUs that support AVX2. That's fine as part of a loop that does plenty of work that can keep port0 / port1 busy, or if you need each 128-bit chunk separately anyway.


For Ryzen/Excavator, lane-crossing vpermq is expensive (because they split 256-bit instructions into multiple 128-bit uops, and don't have a real lane-crossing shuffle unit: http://agner.org/optimize/). So you'd want to vextracti128 / vpor to combine. Or maybe vpunpcklqdq so you can load the same shuffle mask with a set1_epi64 instead of needing a full 256-bit vector constant to shuffle elements in the upper lane to the upper 64 bits of that lane.

like image 55
Peter Cordes Avatar answered Dec 15 '22 04:12

Peter Cordes