Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SSE: shuffle (permutevar) 4x32 integers

I have some code using the AVX2 intrinsic _mm256_permutevar8x32_epi32 aka vpermd to select integers from an input vector by an index vector. Now I need the same thing but for 4x32 instead of 8x32. _mm_permutevar_ps does it for floating point, but I'm using integers.

One idea is _mm_shuffle_epi32, but I'd first need to convert my 4x32 index values to a single integer, that is:

imm[1:0] := idx[31:0]
imm[3:2] := idx[63:32]
imm[5:4] := idx[95:64]
imm[7:6] := idx[127:96]

I'm not sure what's the best way to do that, and moreover I'm not sure it's the best way to proceed. I'm looking for the most efficient method on Broadwell/Haswell to emulate the "missing" _mm_permutevar_epi32(__m128i a, __m128i idx). I'd rather use 128-bit instructions than 256-bit ones if possible (i.e. I don't want to widen the 128-bit inputs then narrow the result).

like image 333
John Zwinck Avatar asked May 08 '19 03:05

John Zwinck


2 Answers

It's useless to generate an immediate at run-time, unless you're JITing new code. An immediate is a byte that's literally part of the machine-code instruction encoding. That's great if you have a compile-time-constant shuffle (after inlining + template expansion), otherwise forget about those shuffles that take the control operand as an integer1.


Before AVX, the only variable-control shuffle was SSSE3 pshufb. (_mm_shuffle_epi8). That's still the only 128-bit (or in-lane) integer shuffle instruction in AVX2 and I think AVX512.

AVX1 added some in-lane 32-bit variable shuffles, like vpermilps (_mm_permutevar_ps). AVX2 added lane-crossing integer and FP shuffles, but somewhat strangely no 128-bit version of vpermd. Perhaps because Intel microarchitectures have no penalty for using FP shuffles on integer data. (Which is true on Sandybridge family, I just don't know if that was part of the reasoning for the ISA design). But you'd think they would have added __m128i intrinsics for vpermilps if that's what you were "supposed" to do. Or maybe the compiler / intrinsics design people didn't agree with the asm instruction-set people?


If you have a runtime-variable vector of 32-bit indices and want to do a shuffle with 32-bit granularity, by far your best bet is to just use AVX _mm_permutevar_ps.

_mm_castps_si128( _mm_permutevar_ps (_mm_castsi128_ps(a), idx) )

On Intel at least, it won't even introduce any extra bypass latency when used between integer instructions like paddd; i.e. FP shuffles specifically (not blends) have no penalty for use on integer data in Sandybridge-family CPUs.

If there's any penalty on AMD Bulldozer or Ryzen, it's minor and definitely cheaper than the cost of calculating a shuffle-control vector for (v)pshufb.

Using vpermd ymm and ignoring the upper 128 bits of input and output (i.e. by using cast intrinsics) would be much slower on AMD (because its 128-bit SIMD design has to split lane-crossing 256-bit shuffles into several uops), and also worse on Intel where it makes it 3c latency instead of 1 cycle.


@Iwill's answer shows a way to calculate a shuffle-control vector of byte indices for pshufb from a vector of 4x32-bit dword indices. But it uses SSE4.1 pmulld which is 2 uops on most CPUs, and could easily be a worse bottleneck than shuffles. (See discussion in comments under that answer.) Especially on older CPUs without AVX, some of which can do 2 pshufb per clock unlike modern Intel (Haswell and later only have 1 shuffle port and easily bottleneck on shuffles. IceLake will add another shuffle port, according to Intel's Sunny Cove presentation.)

If you do have to write an SSSE3 or SSE4.1 version of this, it's probably best to still use only SSSE3 and use pshufb plus a left shift to duplicate a byte within a dword before ORing in the 0,1,2,3 into the low bits, not pmulld. SSE4.1 pmulld is multiple uops and even worse than pshufb on some CPUs with slow pshufb. (You might not benefit from vectorizing at all on CPUs with only SSSE3 and not SSE4.1, i.e. first-gen Core2, because it has slow-ish pshufb.)

On 2nd-gen Core2, and Goldmont, pshufb is a single-uop instruction with 1-cycle latency. On Silvermont and first-gen Core 2 it's not so good. But overall I'd recommend pshufb + pslld + por to calculate a control-vector for another pshufb if AVX isn't available.

An extra shuffle to prepare for a shuffle is far worse than just using vpermilps on any CPU that supports AVX.


Footnote 1:

You'd have to use a switch or something to select a code path with the right compile-time-constant integer, and that's horrible; only consider that if you don't even have SSSE3 available. It may be worse than scalar unless the jump-table branch predicts perfectly.

like image 57
Peter Cordes Avatar answered Oct 05 '22 11:10

Peter Cordes


Although Peter Cordes is correct in saying that the AVX instruction vpermilps and its intrinsic _mm_permutevar_ps() will probably do the job, if you're working on machines older than Sandy Bridge, an SSE4.1 variant using pshufb works quite well too.

AVX variant

Credits to @PeterCordes

#include <stdio.h>
#include <immintrin.h>


__m128i vperm(__m128i a, __m128i idx){
    return _mm_castps_si128(_mm_permutevar_ps(_mm_castsi128_ps(a), idx));
}


int main(int argc, char* argv[]){
    __m128i a   = _mm_set_epi32(0xDEAD, 0xBEEF, 0xCAFE, 0x0000);
    __m128i idx = _mm_set_epi32(1,0,3,2);
    __m128i shu = vperm(a, idx);
    printf("%04x %04x %04x %04x\n", ((unsigned*)(&shu))[3],
                                    ((unsigned*)(&shu))[2],
                                    ((unsigned*)(&shu))[1],
                                    ((unsigned*)(&shu))[0]);
    return 0;
}

SSE4.1 variant

#include <stdio.h>
#include <immintrin.h>


__m128i vperm(__m128i a, __m128i idx){
    idx = _mm_and_si128  (idx, _mm_set1_epi32(0x00000003));
    idx = _mm_mullo_epi32(idx, _mm_set1_epi32(0x04040404));
    idx = _mm_or_si128   (idx, _mm_set1_epi32(0x03020100));
    return _mm_shuffle_epi8(a, idx);
}


int main(int argc, char* argv[]){
    __m128i a   = _mm_set_epi32(0xDEAD, 0xBEEF, 0xCAFE, 0x0000);
    __m128i idx = _mm_set_epi32(1,0,3,2);
    __m128i shu = vperm(a, idx);
    printf("%04x %04x %04x %04x\n", ((unsigned*)(&shu))[3],
                                    ((unsigned*)(&shu))[2],
                                    ((unsigned*)(&shu))[1],
                                    ((unsigned*)(&shu))[0]);
    return 0;
}

This compiles down to the crisp

0000000000400550 <vperm>:
  400550:       c5 f1 db 0d b8 00 00 00         vpand  0xb8(%rip),%xmm1,%xmm1        # 400610 <_IO_stdin_used+0x20>
  400558:       c4 e2 71 40 0d bf 00 00 00      vpmulld 0xbf(%rip),%xmm1,%xmm1        # 400620 <_IO_stdin_used+0x30>
  400561:       c5 f1 eb 0d c7 00 00 00         vpor   0xc7(%rip),%xmm1,%xmm1        # 400630 <_IO_stdin_used+0x40>
  400569:       c4 e2 79 00 c1                  vpshufb %xmm1,%xmm0,%xmm0
  40056e:       c3                              retq

The AND-masking is optional if you can guarantee that the control indices will always be the 32-bit integers 0, 1, 2 or 3.

like image 38
Iwillnotexist Idonotexist Avatar answered Oct 05 '22 09:10

Iwillnotexist Idonotexist