Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to roll a significantly faster version of sqrt

In an app I'm profiling, I found that in some scenarios this function is able to take over 10% of total execution time.

I've seen discussion over the years of faster sqrt implementations using sneaky floating-point trickery, but I don't know if such things are outdated on modern CPUs.

MSVC++ 2008 compiler is being used, for reference... though I'd assume sqrt is not going to add much overhead though.

See also here for similar discussion on modf function.

EDIT: for reference, this is one widely-used method, but is it actually much quicker? How many cycles is SQRT anyway these days?

like image 679
Mr. Boy Avatar asked Apr 14 '10 13:04

Mr. Boy


People also ask

Is fast inverse square root still faster?

AVX_InvSqrt runs ~123% faster than our benchmark. Even though, _mm256_rsqrt_ps intrinsic function can approximately compute reciprocal square root for 8 float values at the same time, for a single float value, it is 4% slower than Fast_InvSqrt.

Is square root still slow?

The interviewer is right. sqrt() is "slow," not slow. Fundamentally, the operation is more complex than an addition or division.

Why is square root slow?

The square root function uses Newton's method to iteratively calculate the square root. It converges quadratically. Nothing will speed that up. The other functions, exp() and ln(x), have implementations that have their own convergence/complexity issues.

How do you use sqrt in C++?

C++ sqrt() The sqrt() function in C++ returns the square root of a number. This function is defined in the cmath header file. Mathematically, sqrt(x) = √x .


1 Answers

Yes, it is possible even without trickery:

  1. sacrifice accuracy for speed: the sqrt algorithm is iterative, re-implement with fewer iterations.

  2. lookup tables: either just for the start point of the iteration, or combined with interpolation to get you all the way there.

  3. caching: are you always sqrting the same limited set of values? if so, caching can work well. I've found this useful in graphics applications where the same thing is being calculated for lots of shapes the same size, so results can be usefully cached.


Hello from 11 years in the future.

Considering this still gets occasional votes, I thought I'd add a note about performance, which now even more than then is dramatically limited by memory accesses. You absolutely must use a realistic benchmark (ideally, your whole application) when optimising something like this - the memory access patterns of your application will have a dramatic effect on solutions like lookup tables and caches, and just comparing 'cycles' for your optimised version will lead you wildly astray: it is also very difficult to assign program time to individual instructions, and your profiling tool may mislead you here.

  1. On a related note, consider using simd/vectorised instructions for calculating square roots, like _mm512_sqrt_ps or similar, if they suit your use case.

  2. Take a look at section 15.12.3 of intel's optimisation reference manual, which describes approximation methods, with vectorised instructions, which would probably translate pretty well to other architectures too.

like image 136
James Avatar answered Sep 21 '22 21:09

James