In the past few weeks I was wondering what is the point that people trying to re-invent the wheel and spend hours to write their own sqrt function for example. The built-in version is optimized well, precise and stable enough.
I am speaking about the Carmack-style Square Root for example. What is the point? It will lose precision during the approximation and it uses casting.
Intel style SSE Square Root was giving precise results, but was slower in my calculations than the standard SQRT.
By average, all the above tricks were beaten by far the standard SQRT. So my question is, what is the point?
My PC has the below CPU:
Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz.
I've got the following results for each method (I've fixed the performance test according to the below suggestion by the helpful comment, thanks for that n.m.):
(Please keep in mind that if you're using approximation like Newton method, you'll lose precision so you must align your calculation accordingly.)
You can find the source code below for reference.
#include <chrono>
#include <cmath>
#include <deque>
#include <iomanip>
#include <iostream>
#include <immintrin.h>
#include <random>
using f64 = double;
using s64 = int64_t;
using u64 = uint64_t;
static constexpr u64 cycles = 24;
static constexpr u64 sample_max = 1000000;
f64 sse_sqrt(const f64 x) {
__m128d root = _mm_sqrt_pd(_mm_load_pd(&x));
return *(reinterpret_cast<f64*>(&root));
}
constexpr f64 carmack_sqrt(const f64 x) {
union {
f64 x;
s64 i;
} u = {};
u.x = x;
u.i = 0x5fe6eb50c7b537a9 - (u.i >> 1);
f64 xhalf = 0.5 * x;
u.x = u.x * (1.5 - xhalf * u.x * u.x);
# u.x = u.x * (1.5 - xhalf * u.x * u.x);
# u.x = u.x * (1.5 - xhalf * u.x * u.x);
# ... so on, if you want more precise result ...
return u.x * x;
}
int main(int /* argc */, char ** /*argv*/) {
std::random_device r;
std::default_random_engine e(r());
std::uniform_real_distribution<f64> dist(1, sample_max);
std::deque<f64> samples(sample_max);
for (auto& sample : samples) {
sample = dist(e);
}
// std sqrt
{
std::cout << "> Measuring std sqrt.\r\n> Please wait . . .\r\n";
f64 result = 0;
auto t1 = std::chrono::high_resolution_clock::now();
for (auto cycle = 0; cycle < cycles; ++cycle) {
for (auto& sample : samples) {
result += std::sqrt(static_cast<f64>(sample));
}
}
auto t2 = std::chrono::high_resolution_clock::now();
auto dt = t2 - t1;
std::cout << "> Accumulated result: " << std::setprecision(19) << result << "\n";
std::cout << "> Total execution time: " <<
std::chrono::duration_cast<std::chrono::milliseconds>(dt).count() << " ms.\r\n\r\n";
}
// sse sqrt
{
std::cout << "> Measuring sse sqrt.\r\n> Please wait . . .\r\n";
f64 result = 0;
auto t1 = std::chrono::high_resolution_clock::now();
for (auto cycle = 0; cycle < cycles; ++cycle) {
for (auto& sample : samples) {
result += sse_sqrt(static_cast<f64>(sample));
}
}
auto t2 = std::chrono::high_resolution_clock::now();
auto dt = t2 - t1;
std::cout << "> Accumulated result: " << std::setprecision(19) << result << "\n";
std::cout << "> Total execution time: " <<
std::chrono::duration_cast<std::chrono::milliseconds>(dt).count() << " ms.\r\n\r\n";
}
// carmack sqrt
{
std::cout << "> Measuring carmack sqrt.\r\n> Please wait . . .\r\n";
f64 result = 0;
auto t1 = std::chrono::high_resolution_clock::now();
for (auto cycle = 0; cycle < cycles; ++cycle) {
for (auto& sample : samples) {
result += carmack_sqrt(static_cast<f64>(sample));
}
}
auto t2 = std::chrono::high_resolution_clock::now();
auto dt = t2 - t1;
std::cout << "> Accumulated result: " << std::setprecision(19) << result << "\n";
std::cout << "> Total execution time: " <<
std::chrono::duration_cast<std::chrono::milliseconds>(dt).count() << " ms.\r\n\r\n";
}
std::cout << "> Press any key to exit . . .\r\n";
std::getchar();
return 0;
}
Please note that I am not here for criticizing anybody, I am just here for learning, experimenting and trying to figure out my own method and the best toolset to choose from.
I am writing my own game engine to one of my portfolio. I am appreciating your kind answers and I am open for any suggestions.
Have a nice day.
That fast reciprocal square root trick is mostly obsolete. SSE's built in approximate reciprocal square root which exists since the Pentium 3 has completely replaced it on the PC platform. Other platforms usually have their own reciprocal square root, for example ARM has VRSQRTE
and a handy instruction that does the Newton step too.
By the way, turning the result into a non-reciprocal square root usually makes it less useful: the primary use case is normalizing a vector, where a "straight" square root is annoying (it would have to be divided by) while a reciprocal square root fits exactly (then it's a multiply).
As often, your benchmark isn't quite accurate. I happen to have done some relevant tests a while ago, where the relevant parts look like this:
std::sqrt
based:
HMM_INLINE float HMM_LengthVec4(hmm_vec4 A)
{
float Result = std::sqrt(HMM_LengthSquaredVec4(A));
return(Result);
}
HMM_INLINE hmm_vec4 HMM_NormalizeVec4(hmm_vec4 A)
{
hmm_vec4 Result = {0};
float VectorLength = HMM_LengthVec4(A);
/* NOTE(kiljacken): We need a zero check to not divide-by-zero */
if (VectorLength != 0.0f)
{
float Multiplier = 1.0f / VectorLength;
#ifdef HANDMADE_MATH__USE_SSE
__m128 SSEMultiplier = _mm_set1_ps(Multiplier);
Result.InternalElementsSSE = _mm_mul_ps(A.InternalElementsSSE, SSEMultiplier);
#else
Result.X = A.X * Multiplier;
Result.Y = A.Y * Multiplier;
Result.Z = A.Z * Multiplier;
Result.W = A.W * Multiplier;
#endif
}
return (Result);
}
SSE reciprocal square root plus Newton step:
HMM_INLINE hmm_vec4 HMM_NormalizeVec4_new(hmm_vec4 A)
{
hmm_vec4 Result;
// square elements and add them together, result is in every lane
__m128 t0 = _mm_mul_ps(A.InternalElementsSSE, A.InternalElementsSSE);
__m128 t1 = _mm_add_ps(t0, _mm_shuffle_ps(t0, t0, _MM_SHUFFLE(2, 3, 0, 1)));
__m128 sq = _mm_add_ps(t1, _mm_shuffle_ps(t1, t1, _MM_SHUFFLE(0, 1, 2, 3)));
// compute reciprocal square root with Newton step for ~22bit accuracy
__m128 rLen = _mm_rsqrt_ps(sq);
__m128 half = _mm_set1_ps(0.5);
__m128 threehalf = _mm_set1_ps(1.5);
__m128 t = _mm_mul_ps(_mm_mul_ps(sq, half), _mm_mul_ps(rLen, rLen));
rLen = _mm_mul_ps(rLen, _mm_sub_ps(threehalf, t));
// multiply elements by the reciprocal of the vector length
__m128 normed = _mm_mul_ps(A.InternalElementsSSE, rLen);
// normalize zero-vector to zero, not to NaN
__m128 zero = _mm_setzero_ps();
Result.InternalElementsSSE = _mm_andnot_ps(_mm_cmpeq_ps(A.InternalElementsSSE, zero), normed);
return (Result);
}
SSE reciprocal square root without Newton step:
HMM_INLINE hmm_vec4 HMM_NormalizeVec4_lowacc(hmm_vec4 A)
{
hmm_vec4 Result;
// square elements and add them together, result is in every lane
__m128 t0 = _mm_mul_ps(A.InternalElementsSSE, A.InternalElementsSSE);
__m128 t1 = _mm_add_ps(t0, _mm_shuffle_ps(t0, t0, _MM_SHUFFLE(2, 3, 0, 1)));
__m128 sq = _mm_add_ps(t1, _mm_shuffle_ps(t1, t1, _MM_SHUFFLE(0, 1, 2, 3)));
// compute reciprocal square root without Newton step for ~12bit accuracy
__m128 rLen = _mm_rsqrt_ps(sq);
// multiply elements by the reciprocal of the vector length
__m128 normed = _mm_mul_ps(A.InternalElementsSSE, rLen);
// normalize zero-vector to zero, not to NaN
__m128 zero = _mm_setzero_ps();
Result.InternalElementsSSE = _mm_andnot_ps(_mm_cmpeq_ps(A.InternalElementsSSE, zero), normed);
return (Result);
}
(quick-bench)
As you can see I measured throughput and latency separately, and the distinction mattered a lot. Reciprocal square root with a Newton step takes a long time, about as long as using a normal square root, but can be processed at a higher throughput. Without Newton step, a single vector-normalize operation takes less time from start to finish too, and the throughput becomes even better than before. Anyway this should demonstrate that there is some point to doing something about your square roots.
By the way the above code is not meant to be Good Practice, that would be normalizing 4 vectors simultaneously, so as to not waste a 4-wide SIMD operation on calculating a single (reciprocal) square root. That's not really the issue here though.
For fun and profit?
Based on your question, there is no reason to do so, but if you want to learn a language, it is recommended to solve mathematical problems, because they rely on integers/floats (which are primitives in (mostly) any language) and the algorithms are well documented.
In "real" code, one should use the provided methods by libc as long as you have one. Embedded platforms usually lack of a libc or roll their own, so you have to implement your own.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With