I figured it was about time to use a fast reciprocal square root. So, I tried writing a function (which would be marked inline
in production):
float sqrt_recip(float x) {
return _mm_cvtss_f32( _mm_rsqrt_ss( _mm_set_ps1(x) ) ); //same as _mm_set1_ps
}
TL;DR: My question is "how can I get GCC and ICC to output minimal assembly (two instructions) for the above function, preferably without resorting to raw assembly (sticking with intrinsics)?"
As written, on ICC 13.0.1, GCC 5.2.0, and Clang 3.7 the output is:
shufps xmm0, xmm0, 0
rsqrtss xmm0, xmm0
ret
This makes sense, since I used _mm_set_ps1
to scatter x
into all components of the register. But, I don't really need to do that. I'd prefer only doing the last two lines. Sure, shufps
is only one cycle. But rsqrtss
is only three to five. It's 20% to 33% overhead that's completely worthless.
Some things I tried:
I tried just not setting it:union { __m128 v; float f[4]; } u;
u.f[0] = x;
return _mm_cvtss_f32(_mm_rsqrt_ss(u.v));
This actually works for Clang, but the output for ICC and GCC in particular is appalling.
Instead of scattering, you can fill with zeroes (that is, use _mm_set_ss
). Again, neither GCC nor ICC's output is optimal. In GCC's case, GCC hilariously adds this:movss DWORD PTR [rsp-12], xmm0
movss xmm0, DWORD PTR [rsp-12]
It is three-and-a-half years later and, while compilers have advanced and the situation has gotten better, they still do not output optimal code.
However, without dropping to raw assembly, we can still do better than intrinsics by using inline assembly. We have to be a bit careful; there is a significant penalty for switching between non-VEX-encoded and VEX-encoded instructions, so we need two codepaths.
This produces optimal results on GCC (9.0.1), Clang (9.0.0), and ICC (19.0.1.144). It only produces optimal results on MSVC (19.16) when inlined and not VEX-encoded (and this probably as-good as we can do since MSVC doesn't support inline assembly on x86-64):
#include <xmmintrin.h>
inline float rsqrt_fast(float x) {
#ifndef _MSC_VER //Optimal
float result;
asm( //Note AT&T order
#ifdef __AVX__
"vrsqrtss %1, %1, %0"
#else
"rsqrtss %1, %0"
#endif
: "=x"(result)
: "x"(x)
);
return result;
#else //TODO: not optimal when in AVX mode or when not inlined
return _mm_cvtss_f32(_mm_rsqrt_ss(_mm_set_ps1(x)));
#endif
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With