Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

Question

I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to square root I've noticed something odd: using the SSE scalar operations, it is faster to take a reciprocal square root and multiply it to get the sqrt, than it is to use the native sqrt opcode!

I'm testing it with a loop something like:

inline float TestSqrtFunction( float in );

void TestFunc()
{
  #define ARRAYSIZE 4096
  #define NUMITERS 16386
  float flIn[ ARRAYSIZE ]; // filled with random numbers ( 0 .. 2^22 )
  float flOut [ ARRAYSIZE ]; // filled with 0 to force fetch into L1 cache

  cyclecounter.Start();
  for ( int i = 0 ; i < NUMITERS ; ++i )
    for ( int j = 0 ; j < ARRAYSIZE ; ++j )
    {
       flOut[j] = TestSqrtFunction( flIn[j] );
       // unrolling this loop makes no difference -- I tested it.
    }
  cyclecounter.Stop();
  printf( "%d loops over %d floats took %.3f milliseconds",
          NUMITERS, ARRAYSIZE, cyclecounter.Milliseconds() );
}

I've tried this with a few different bodies for the TestSqrtFunction, and I've got some timings that are really scratching my head. The worst of all by far was using the native sqrt() function and letting the "smart" compiler "optimize". At 24ns/float, using the x87 FPU this was pathetically bad:

inline float TestSqrtFunction( float in )
{  return sqrt(in); }

The next thing I tried was using an intrinsic to force the compiler to use SSE's scalar sqrt opcode:

inline void SSESqrt( float * restrict pOut, float * restrict pIn )
{
   _mm_store_ss( pOut, _mm_sqrt_ss( _mm_load_ss( pIn ) ) );
   // compiles to movss, sqrtss, movss
}

This was better, at 11.9ns/float. I also tried Carmack's wacky Newton-Raphson approximation technique, which ran even better than the hardware, at 4.3ns/float, although with an error of 1 in 2¹⁰ (which is too much for my purposes).

The doozy was when I tried the SSE op for reciprocal square root, and then used a multiply to get the square root ( x * 1/√x = √x ). Even though this takes two dependent operations, it was the fastest solution by far, at 1.24ns/float and accurate to 2^-14:

inline void SSESqrt_Recip_Times_X( float * restrict pOut, float * restrict pIn )
{
   __m128 in = _mm_load_ss( pIn );
   _mm_store_ss( pOut, _mm_mul_ss( in, _mm_rsqrt_ss( in ) ) );
   // compiles to movss, movaps, rsqrtss, mulss, movss
}

My question is basically what gives? Why is SSE's built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations?

I'm sure that this is really the cost of the op itself, because I've verified:

All data fits in cache, and accesses are sequential
the functions are inlined
unrolling the loop makes no difference
compiler flags are set to full optimization (and the assembly is good, I checked)

(edit: stephentyrone correctly points out that operations on long strings of numbers should use the vectorizing SIMD packed ops, like rsqrtps — but the array data structure here is for testing purposes only: what I am really trying to measure is scalar performance for use in code that can't be vectorized.)

Stephen Canon · Accepted Answer

sqrtss gives a correctly rounded result. rsqrtss gives an approximation to the reciprocal, accurate to about 11 bits.

sqrtss is generating a far more accurate result, for when accuracy is required. rsqrtss exists for the cases when an approximation suffices, but speed is required. If you read Intel's documentation, you will also find an instruction sequence (reciprocal square-root approximation followed by a single Newton-Raphson step) that gives nearly full precision (~23 bits of accuracy, if I remember properly), and is still somewhat faster than sqrtss.

edit: If speed is critical, and you're really calling this in a loop for many values, you should be using the vectorized versions of these instructions, rsqrtps or sqrtps, both of which process four floats per instruction.

Pseudonym · Answer

There are a number of other answers to this already from a few years ago. Here's what the consensus got right:

The rsqrt* instructions compute an approximation to the reciprocal square root, good to about 11-12 bits.
It's implemented with a lookup table (i.e. a ROM) indexed by the mantissa. (In fact, it's a compressed lookup table, similar to mathematical tables of old, using adjustments to the low-order bits to save on transistors.)
The reason why it's available is that it is the initial estimate used by the FPU for the "real" square root algorithm.
There's also an approximate reciprocal instruction, rcp. Both of these instructions are a clue to how the FPU implements square root and division.

Here's what the consensus got wrong:

SSE-era FPUs do not use Newton-Raphson to compute square roots. It's a great method in software, but it would be a mistake to implement it that way in hardware.

The N-R algorithm to compute reciprocal square root has this update step, as others have noted:

x' = 0.5 * x * (3 - n*x*x);

That's a lot of data-dependent multiplications and one subtraction.

What follows is the algorithm that modern FPUs actually use.

Given b[0] = n, suppose we can find a series of numbers Y[i] such that b[n] = b[0] * Y[0]^2 * Y[1]^2 * ... * Y[n]^2 approaches 1. Then consider:

x[n] = b[0] * Y[0] * Y[1] * ... * Y[n]
y[n] = Y[0] * Y[1] * ... * Y[n]

Clearly x[n] approaches sqrt(n) and y[n] approaches 1/sqrt(n).

We can use the Newton-Raphson update step for reciprocal square root to get a good Y[i]:

b[i] = b[i-1] * Y[i-1]^2
Y[i] = 0.5 * (3 - b[i])

Then:

x[0] = n Y[0]
x[i] = x[i-1] * Y[i]

and:

y[0] = Y[0]
y[i] = y[i-1] * Y[i]

The next key observation is that b[i] = x[i-1] * y[i-1]. So:

Y[i] = 0.5 * (3 - x[i-1] * y[i-1])
     = 1 + 0.5 * (1 - x[i-1] * y[i-1])

Then:

x[i] = x[i-1] * (1 + 0.5 * (1 - x[i-1] * y[i-1]))
     = x[i-1] + x[i-1] * 0.5 * (1 - x[i-1] * y[i-1]))
y[i] = y[i-1] * (1 + 0.5 * (1 - x[i-1] * y[i-1]))
     = y[i-1] + y[i-1] * 0.5 * (1 - x[i-1] * y[i-1]))

That is, given initial x and y, we can use the following update step:

r = 0.5 * (1 - x * y)
x' = x + x * r
y' = y + y * r

Or, even fancier, we can set h = 0.5 * y. This is the initialisation:

Y = approx_rsqrt(n)
x = Y * n
h = Y * 0.5

And this is the update step:

r = 0.5 - x * h
x' = x + x * r
h' = h + h * r

This is Goldschmidt's algorithm, and it has a huge advantage if you're implementing it in hardware: the "inner loop" is three multiply-adds and nothing else, and two of them are independent and can be pipelined.

In 1999, FPUs already needed a pipelined add/substract circuit and a pipelined multiply circuit, otherwise SSE would not be very "streaming". Only one of each circuit was needed in 1999 to implement this inner loop in a fully-pipelined way without wasting a lot of hardware just on square root.

Today, of course, we have fused multiply-add exposed to the programmer. Again, the inner loop is three pipelined FMAs, which are (again) generally useful even if you're not computing square roots.

Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

Tags:

performance

floating-point

x86

assembly

sse

Crashworks

2 Answers

Stephen Canon

Pseudonym

Recent Activity

Donate For Us

Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

Tags:

performance

floating-point

x86

assembly

sse

Crashworks

2 Answers

Stephen Canon

Pseudonym

Related questions

Recent Activity

Donate For Us