I have some code using the AVX2 intrinsic _mm256_permutevar8x32_epi32
aka vpermd
to select integers from an input vector by an index vector. Now I need the same thing but for 4x32 instead of 8x32. _mm_permutevar_ps
does it for floating point, but I'm using integers.
One idea is _mm_shuffle_epi32
, but I'd first need to convert my 4x32 index values to a single integer, that is:
imm[1:0] := idx[31:0]
imm[3:2] := idx[63:32]
imm[5:4] := idx[95:64]
imm[7:6] := idx[127:96]
I'm not sure what's the best way to do that, and moreover I'm not sure it's the best way to proceed. I'm looking for the most efficient method on Broadwell/Haswell to emulate the "missing" _mm_permutevar_epi32(__m128i a, __m128i idx)
. I'd rather use 128-bit instructions than 256-bit ones if possible (i.e. I don't want to widen the 128-bit inputs then narrow the result).
It's useless to generate an immediate at run-time, unless you're JITing new code. An immediate is a byte that's literally part of the machine-code instruction encoding. That's great if you have a compile-time-constant shuffle (after inlining + template expansion), otherwise forget about those shuffles that take the control operand as an integer1.
Before AVX, the only variable-control shuffle was SSSE3 pshufb
. (_mm_shuffle_epi8
). That's still the only 128-bit (or in-lane) integer shuffle instruction in AVX2 and I think AVX512.
AVX1 added some in-lane 32-bit variable shuffles, like vpermilps
(_mm_permutevar_ps
). AVX2 added lane-crossing integer and FP shuffles, but somewhat strangely no 128-bit version of vpermd
. Perhaps because Intel microarchitectures have no penalty for using FP shuffles on integer data. (Which is true on Sandybridge family, I just don't know if that was part of the reasoning for the ISA design). But you'd think they would have added __m128i
intrinsics for vpermilps
if that's what you were "supposed" to do. Or maybe the compiler / intrinsics design people didn't agree with the asm instruction-set people?
If you have a runtime-variable vector of 32-bit indices and want to do a shuffle with 32-bit granularity, by far your best bet is to just use AVX _mm_permutevar_ps
.
_mm_castps_si128( _mm_permutevar_ps (_mm_castsi128_ps(a), idx) )
On Intel at least, it won't even introduce any extra bypass latency when used between integer instructions like paddd
; i.e. FP shuffles specifically (not blends) have no penalty for use on integer data in Sandybridge-family CPUs.
If there's any penalty on AMD Bulldozer or Ryzen, it's minor and definitely cheaper than the cost of calculating a shuffle-control vector for (v)pshufb
.
Using vpermd ymm
and ignoring the upper 128 bits of input and output (i.e. by using cast intrinsics) would be much slower on AMD (because its 128-bit SIMD design has to split lane-crossing 256-bit shuffles into several uops), and also worse on Intel where it makes it 3c latency instead of 1 cycle.
@Iwill's answer shows a way to calculate a shuffle-control vector of byte indices for pshufb
from a vector of 4x32-bit dword indices. But it uses SSE4.1 pmulld
which is 2 uops on most CPUs, and could easily be a worse bottleneck than shuffles. (See discussion in comments under that answer.) Especially on older CPUs without AVX, some of which can do 2 pshufb
per clock unlike modern Intel (Haswell and later only have 1 shuffle port and easily bottleneck on shuffles. IceLake will add another shuffle port, according to Intel's Sunny Cove presentation.)
If you do have to write an SSSE3 or SSE4.1 version of this, it's probably best to still use only SSSE3 and use pshufb
plus a left shift to duplicate a byte within a dword before ORing in the 0,1,2,3
into the low bits, not pmulld
. SSE4.1 pmulld
is multiple uops and even worse than pshufb
on some CPUs with slow pshufb
. (You might not benefit from vectorizing at all on CPUs with only SSSE3 and not SSE4.1, i.e. first-gen Core2, because it has slow-ish pshufb
.)
On 2nd-gen Core2, and Goldmont, pshufb
is a single-uop instruction with 1-cycle latency. On Silvermont and first-gen Core 2 it's not so good. But overall I'd recommend pshufb
+ pslld
+ por
to calculate a control-vector for another pshufb
if AVX isn't available.
An extra shuffle to prepare for a shuffle is far worse than just using vpermilps
on any CPU that supports AVX.
Footnote 1:
You'd have to use a switch
or something to select a code path with the right compile-time-constant integer, and that's horrible; only consider that if you don't even have SSSE3 available. It may be worse than scalar unless the jump-table branch predicts perfectly.
Although Peter Cordes is correct in saying that the AVX instruction vpermilps
and its intrinsic _mm_permutevar_ps()
will probably do the job, if you're working on machines older than Sandy Bridge, an SSE4.1 variant using pshufb
works quite well too.
Credits to @PeterCordes
#include <stdio.h>
#include <immintrin.h>
__m128i vperm(__m128i a, __m128i idx){
return _mm_castps_si128(_mm_permutevar_ps(_mm_castsi128_ps(a), idx));
}
int main(int argc, char* argv[]){
__m128i a = _mm_set_epi32(0xDEAD, 0xBEEF, 0xCAFE, 0x0000);
__m128i idx = _mm_set_epi32(1,0,3,2);
__m128i shu = vperm(a, idx);
printf("%04x %04x %04x %04x\n", ((unsigned*)(&shu))[3],
((unsigned*)(&shu))[2],
((unsigned*)(&shu))[1],
((unsigned*)(&shu))[0]);
return 0;
}
#include <stdio.h>
#include <immintrin.h>
__m128i vperm(__m128i a, __m128i idx){
idx = _mm_and_si128 (idx, _mm_set1_epi32(0x00000003));
idx = _mm_mullo_epi32(idx, _mm_set1_epi32(0x04040404));
idx = _mm_or_si128 (idx, _mm_set1_epi32(0x03020100));
return _mm_shuffle_epi8(a, idx);
}
int main(int argc, char* argv[]){
__m128i a = _mm_set_epi32(0xDEAD, 0xBEEF, 0xCAFE, 0x0000);
__m128i idx = _mm_set_epi32(1,0,3,2);
__m128i shu = vperm(a, idx);
printf("%04x %04x %04x %04x\n", ((unsigned*)(&shu))[3],
((unsigned*)(&shu))[2],
((unsigned*)(&shu))[1],
((unsigned*)(&shu))[0]);
return 0;
}
This compiles down to the crisp
0000000000400550 <vperm>:
400550: c5 f1 db 0d b8 00 00 00 vpand 0xb8(%rip),%xmm1,%xmm1 # 400610 <_IO_stdin_used+0x20>
400558: c4 e2 71 40 0d bf 00 00 00 vpmulld 0xbf(%rip),%xmm1,%xmm1 # 400620 <_IO_stdin_used+0x30>
400561: c5 f1 eb 0d c7 00 00 00 vpor 0xc7(%rip),%xmm1,%xmm1 # 400630 <_IO_stdin_used+0x40>
400569: c4 e2 79 00 c1 vpshufb %xmm1,%xmm0,%xmm0
40056e: c3 retq
The AND-masking is optional if you can guarantee that the control indices will always be the 32-bit integers 0, 1, 2 or 3.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With