Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting Fewest Instructions for `rsqrtss` Wrapper

I figured it was about time to use a fast reciprocal square root. So, I tried writing a function (which would be marked inline in production):

float sqrt_recip(float x) {
  return _mm_cvtss_f32( _mm_rsqrt_ss( _mm_set_ps1(x) ) ); //same as _mm_set1_ps
}

TL;DR: My question is "how can I get GCC and ICC to output minimal assembly (two instructions) for the above function, preferably without resorting to raw assembly (sticking with intrinsics)?"

As written, on ICC 13.0.1, GCC 5.2.0, and Clang 3.7 the output is:

shufps  xmm0, xmm0, 0
rsqrtss xmm0, xmm0
ret

This makes sense, since I used _mm_set_ps1 to scatter x into all components of the register. But, I don't really need to do that. I'd prefer only doing the last two lines. Sure, shufps is only one cycle. But rsqrtss is only three to five. It's 20% to 33% overhead that's completely worthless.


Some things I tried:

  • I tried just not setting it:
    union { __m128 v; float f[4]; } u;
    u.f[0] = x;
    return _mm_cvtss_f32(_mm_rsqrt_ss(u.v));
    This actually works for Clang, but the output for ICC and GCC in particular is appalling.

  • Instead of scattering, you can fill with zeroes (that is, use _mm_set_ss). Again, neither GCC nor ICC's output is optimal. In GCC's case, GCC hilariously adds this:
    movss DWORD PTR [rsp-12], xmm0
    movss xmm0, DWORD PTR [rsp-12]


like image 289
imallett Avatar asked Sep 21 '15 03:09

imallett


1 Answers

It is three-and-a-half years later and, while compilers have advanced and the situation has gotten better, they still do not output optimal code.

However, without dropping to raw assembly, we can still do better than intrinsics by using inline assembly. We have to be a bit careful; there is a significant penalty for switching between non-VEX-encoded and VEX-encoded instructions, so we need two codepaths.

This produces optimal results on GCC (9.0.1), Clang (9.0.0), and ICC (19.0.1.144). It only produces optimal results on MSVC (19.16) when inlined and not VEX-encoded (and this probably as-good as we can do since MSVC doesn't support inline assembly on x86-64):

#include <xmmintrin.h>


inline float rsqrt_fast(float x) {
    #ifndef _MSC_VER //Optimal
        float result;
        asm( //Note AT&T order
            #ifdef __AVX__
            "vrsqrtss %1, %1, %0"
            #else
            "rsqrtss %1, %0"
            #endif
            : "=x"(result)
            : "x"(x)
        );
        return result;
    #else //TODO: not optimal when in AVX mode or when not inlined
        return _mm_cvtss_f32(_mm_rsqrt_ss(_mm_set_ps1(x)));
    #endif
}
like image 50
imallett Avatar answered Sep 29 '22 07:09

imallett