Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get an arbitrary float from a simd register at runtime?

Tags:

x86

avx

simd

sse

avx2

I want to access an arbitrary float from a simd register. I know that I can do things like:

float get(const __m128i& a, const int idx){
    // editor's note: this type-puns the FP bit-pattern to int and converts to float
    return _mm_extract_ps(a,idx);
}

or

float get(const __m128i& a, const int idx){
    return _mm_cvtss_f32(_mm_shuffle_ps(a,_MM_SHUFFLE(0,0,0,idx));
}

or even using a shift instead of a shuffle. The problem is that these all require idx to be known at compile time (shuffle, shift, and extract all require an 8bit immediate).

I could also do it using _mm_store_ps() and then using the resulting array, but that would require going out to memory. Is there any way of doing it faster than that?

Edit: Ignore the first code snippet, I wanted the float at that position, not as an int, like _mm_extract_ps returns.

like image 801
BadProgrammer99 Avatar asked Jul 18 '18 17:07

BadProgrammer99


Video Answer


1 Answers

First of all you definitely don't want _mm_extract_ps, unless you want to type-pun FP to int1.

But anyway, for a runtime-variable index, you probably don't want to branch to select an instruction with the right imm8.

source + asm output for gcc/icc/clang/msvc on the Godbolt compiler explorer for all the functions in this answer. Including (at the bottom) some test callers that use a compile-time constant idx so you can see what will happen when inlining + constant-propagation happens in your real program. And/or two indices from the same vector (only gcc CSEs and reloads twice from the same store, other compilers store twice).

Store/reload optimizes with gcc/clang/ICC (but the variable-idx version is higher latency). Other ways only optimize nicely for constant inputs with clang. (clang can even see through the pshufb version and turn it into vshufps imm8 or vpermilps imm8, or a no-op for idx=0). Other compilers do stupid stuff like zeroing a vector with vxorps and using that as a vpermilps control!


128-bit vectors: use a variable-shuffle if you have SSSE3 pshufb, or AVX

With AVX1 you can do it in only 2 ALU uops for 128-bit vectors using vpermilps, which is a variable-shuffle that uses dword selector elements, unlike pshufb.

This lets you do exactly the same shuffle as your _mm_shuffle_ps (including copying the low element to the upper 3 elements which is fine), but with a runtime index instead of an immediate.

// you can pass vectors by value.  Not that it matters when inlining
static inline
float get128_avx(__m128i a, int idx){
    __m128i vidx = _mm_cvtsi32_si128(idx);          // vmovd
    __m128  shuffled = _mm_permutevar_ps(a, vidx);  // vpermilps
    return _mm_cvtss_f32(shuffled);
}

gcc and clang compile it like this for x86-64 (Godbolt compiler explorer):

    vmovd           xmm1, edi
    vpermilps       xmm0, xmm0, xmm1
    ret

Without AVX but with SSSE3, you can load or create a mask for pshufb. It's fairly common to index an array of 4 __m128i vectors, especially using a _mm_movemask_ps result as an index. But here we only care about the low 32-bit element, so we can do better than that.

In fact, the regular nature of the pattern means we can create it with a multiply and add, using two 32-bit immediate operands.

static inline
float get128_ssse3(__m128 a, int idx) {
    const uint32_t low4 = 0x03020100, step4=0x04040404;
    uint32_t selector = low4 + idx*step4;
    __m128i vidx = _mm_cvtsi32_si128(selector);

    // alternative: load a 4-byte window into 0..15 from memory.  worse latency
    // static constexpr uint32_t shuffles[4] = { low4, low4+step4*1, low4+step4*2, low4+step4*3 };
    //__m128i vidx = _mm_cvtsi32_si128(shuffles[idx]);
    __m128i shuffled = _mm_shuffle_epi8(_mm_castps_si128(a), vidx);
    return _mm_cvtss_f32(_mm_castsi128_ps(shuffled));
}

gcc output for -O3 -march=nehalem (other compilers do the same, module maybe a wasted movaps):

get128_ssse3(float __vector(4), int):
    imul    edi, edi, 67372036        # 0x04040404
    add     edi, 50462976             # 0x03020100
    movd    xmm1, edi
    pshufb  xmm0, xmm1
    ret                     # with the float we want at the bottom of XMM0

So without AVX, store/reload saves instructions (and uops), especially if the compiler can avoid sign-extending or zero-extending the index.

Latency from idx to result = imul(3) + add(1) + movd(2) + pshufb(1) on Intel CPUs from Core2(Penryn) and newer. Latency from input vector to result is only pshufb, though. (Plus bypass-delay latency on Nehalem.) http://agner.org/optimize/


__m256 256-bit vectors: shuffle with AVX2, otherwise probably store/reload

Unlike AVX1, AVX2 has lane-crossing variable shuffles like vpermps. (AVX1 only has immediate shuffles of whole 128-bit lanes.) We can use vpermps as a drop-in replacement for AVX1 vpermilps to grab an element from a 256-bit vector.

There are two intrinsics for vpermps (See Intel's intrinsics finder).

  • _mm256_permutevar8x32_ps(__m256 a, __m256i idx): the old name, with operands in opposite order to the asm instruction.
  • _mm256_permutexvar_ps(__m256i idx, __m256 a): the new name, introduced with AVX512, with operands in the correct order (matching the asm operand order, opposite from _mm_shuffle_epi8 or _mm_permutevar_ps). The asm instruction-set reference manual entry only lists this version, and lists it with the wrong type (__m256 i for the control operand).

    gcc and ICC accept this mnemonic with only AVX2 enabled, not AVX512. But unfortunately clang only accepts this with -mavx512vl (or -march=skylake-avx512), so you can't portably use it. So just use the clunkier 8x32 name, which works everywhere.

#ifdef __AVX2__
float get256_avx2(__m256 a, int idx) {
    __m128i vidx = _mm_cvtsi32_si128(idx);          // vmovd
    __m256i vidx256 = _mm256_castsi128_si256(vidx);  // no instructions
    __m256  shuffled = _mm256_permutevar8x32_ps(a, vidx256);  // vpermps
    return _mm256_cvtss_f32(shuffled);
}

    // operand order matches asm for the new name: index first, unlike pshufb and vpermilps
    //__m256  shuffled = _mm256_permutexvar_ps(vidx256, a);  // vpermps
#endif

_mm256_castsi128_si256 doesn't technically leaves the upper lane undefined (so the compiler never needs to spend an instruction zero-extending), but we don't care about the upper lane anyway.

This compiles to just

    vmovd   xmm1, edi
    vpermps ymm0, ymm1, ymm0
     # vzeroupper        # these go away when inlining
     # ret

So it's fantastic on Intel CPUs, just 3c latency from input vector to result, and 2 uops throughput cost (but both uops need port 5).

Lane-crossing shuffles on AMD are significantly more expensive.


Store/reload

Cases where store/reload is actually good:

  • 256-bit vectors without AVX2, or 128-bit vectors without SSSE3.
  • if you need 2 or more elements from the same vector (but note that compilers other than gcc store multiple times if you actually call get128_reload. So if you do this, manually inline the vector store and index it multiple times.)
  • When ALU port pressure (especially the shuffle port) is a problem, and throughput is more important than latency. On Intel CPUs, movd xmm, eax also runs on port 5, so it competes with shuffles. But hopefully you're only using scalar extraction outside an inner loop, with lots of surrounding code that does other stuff.

  • When idx is often a compile-time constant and you want to let the compiler pick the shuffles for you.

A bad idx can crash your program instead of just giving you the wrong element, though. The methods that turn the index directly into a shuffle control ignore the high bits.

Beware that ICC sometimes misses optimizing a constant index into a shuffle after inlining. ICC does ok with test_reload2 in the Godbolt example.

Store/reload to a local array is totally fine for throughput (maybe not latency), and only has ~6 cycle latency on typical CPUs, thanks to store-forwarding. Most CPUs have more front-end throughput than vector ALUs, so including some store/reload in the mix is not bad at all, if you're anywhere near bottlenecking on ALU throughput instead of store/load throughput.

A wide store can forward to a narrow reload, subject to a few alignment constraints. I think a naturally-aligned dword reload of any of the 4 or 8 elements of a vector is fine on mainstream Intel CPUs, but you can check Intel's optimization manual. See performance links in the x86 tag wiki.

In GNU C, you can index a vector like an array. It compiles to a store/reload if the index isn't a compile-time constant after inlining.

#ifdef __GNUC__                      // everything except MSVC
float get128_gnuc(__m128 a, int idx) {
    return a[idx]; 
    // clang turns it into idx&3
    // gcc compiles it exactly like get_reload
}
#endif

 # gcc8.1 -O3 -march=haswell
    movsx   rdi, edi                            # sign-extend int to pointer width
    vmovaps XMMWORD PTR [rsp-24], xmm0          # store into the red-zone
    vmovss  xmm0, DWORD PTR [rsp-24+rdi*4]      # reload

The fully portable way to write (a 256-bit version) is:

float get256_reload(__m256 a, int idx) {
    // with lower alignment and storeu, compilers still choose to align by 32 because they see the store
    alignas(32) float tmp[8];
    _mm256_store_ps(tmp, a);
    return tmp[idx];
}

compilers need multiple instructions to align the stack in a stand-alone version of the function, but of course after inlining this would happen only in the outer containing function, hopefully outside any small loops.

You could consider storing high/low halves of the vector separately with vextractf128 and 128-bit vmovups, like GCC does for _mm256_storeu_ps when it doesn't know the destination is aligned, for tune=generic (helps Sandybridge and AMD). That would avoid the need for a 32-byte aligned array, and have basically no downside on AMD CPUs. But it's worse on Intel vs. an aligned store because it costs extra uops, assuming the cost of aligning the stack can be amortized over many get() operations. (Functions using __m256 sometimes end up aligning the stack anyway, so you might already be paying the cost.) You should probably just use an aligned array unless you're tuning only for Bulldozer, Ryzen, and Sandybridge or something.


Footnote 1: _mm_extract_ps returns the FP bit-pattern as an int. The underlying asm instruction (extractps r/m32, xmm, imm8) could be useful to store a float to memory, but not to shuffle an element to the bottom of an XMM register. It's the FP version of pextrd r/m32, xmm, imm8.

So your function is actually casting the integer bit pattern to FP, with a compiler-generated cvtsi2ss, because C allows implicit casting from int to float.

like image 155
Peter Cordes Avatar answered Oct 18 '22 03:10

Peter Cordes