I want an AVX2 (or earlier) intrinsic that will convert an 8-wide 32-bit integer vector (256 bits total) into 8-wide 16-bit integer vector (128 bits total) [discarding the upper 16-bits of each element]. This should be the inverse of "_mm256_cvtepi16_epi32". If there is not a direct instruction, how should I best do this with a sequence of instructions?
There is no single-instruction inverse until AVX512F. __m128i _mm256_cvtepi32_epi16(__m256i a)
(VPMOVDW
), also available for 512->256 or 128->low_half_of_128. (The versions with inputs smaller than a 512-bit ZMM register also require AVX512VL, so only Skylake-X, not Xeon Phi KNL).
There are signed/unsigned saturation versions of that AVX512 instruction, but only AVX512 has a pack instruction that truncates (discarding the upper bytes of each element) instead of saturating.
Or with AVX512BW, you could emulate a lane-crossing 2-input pack using vpermi2w
to produce a 512-bit result from two 512-bit input vectors. On Skylake-AVX512, it decodes to multiple shuffle uops, but so does VPMOVDW
, which is also a lane-crossing shuffle with granularity less than dword (32-bit). http://instlatx64.atw.hu/ has a spreadsheet of SKX uops / ports.
The SSE2/AVX2 pack instructions like _mm256_packus_epi32
(vpackusdw
) do signed or unsigned saturation, as well as operating within each 128-bit lane. This is unlike the lane-crossing behaviour of vpmovzxwd
.
You could _mm256_and_si256
to clear the high bytes before packing, though. That could be good if you have multiple input vectors, because packs_epi32
takes 2 input vectors and produces a 256-bit output.
a = H G F E | D C B A 32-bit signed elements, shown from high element to low element, low 128-bit lane on the right
b = P O N M | L K J I
_mm256_packus_epi32(a, b) 16-bit unsigned elements
P O N M H G F E | L K J I D C B A
elements from first operand go to the low half of each lane
If you can make efficient use of 2x vpand
/ vpackuswd ymm
/ vpermq ymm
to get a 256-bit vector with all the elements in the right order, then that's probably best on Intel CPUs. Only 2 shuffle uops (4 total uops) per 256 bits of results, and you get them in a single vector.
Or you can use SSSE3 / AVX2 vpshufb
(_mm256_shuffle_epi8
) to extract the bytes you want from a single input, and zero the other half of each 128-bit lane (by setting the shuffle-control value for that element to have the sign bit set). Then use AVX2 vpermq
to shuffle data from the two lanes into just the low 128.
__m256i trunc_elements = _mm256_shuffle_epi8(res256, shuffle_mask_32_to_16);
__m256i ordered = _mm256_permute4x64_epi64(trunc_elements, 0x58);
__m128i result = _mm256_castsi256_si128(ordered); // no asm instructions
So this is 2 uops per 128 bits of results, but both of the uops are shuffles that run only on port 5 on mainstream Intel CPUs that support AVX2. That's fine as part of a loop that does plenty of work that can keep port0 / port1 busy, or if you need each 128-bit chunk separately anyway.
For Ryzen/Excavator, lane-crossing vpermq
is expensive (because they split 256-bit instructions into multiple 128-bit uops, and don't have a real lane-crossing shuffle unit: http://agner.org/optimize/). So you'd want to vextracti128
/ vpor
to combine. Or maybe vpunpcklqdq
so you can load the same shuffle mask with a set1_epi64
instead of needing a full 256-bit vector constant to shuffle elements in the upper lane to the upper 64 bits of that lane.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With