ISBN 978-0-387-72257-3 . Archived (PDF) from the original on 2019-05-01. Striegel, Jason (2008-12-04). "Quake's fast inverse square root".
Beyond3D published two articles by Rys Sommefeldt on exactly this topic in 2006 and 2007: part 1 and part 2. According to these, the fast inverse square root algorithm was invented in the late eighties by Greg Walsh, inspired by Cleve Moler.
Many people have also asked if it's still useful on modern processors. The most common answer is no, that this hack was useful 20+ years ago and that modern-day processors have better and faster ways to approximate the inverse square root.
FYI. Carmack didn't write it. Terje Mathisen and Gary Tarolli both take partial (and very modest) credit for it, as well as crediting some other sources.
How the mythical constant was derived is something of a mystery.
To quote Gary Tarolli:
Which actually is doing a floating point computation in integer - it took a long time to figure out how and why this works, and I can't remember the details anymore.
A slightly better constant, developed by an expert mathematician (Chris Lomont) trying to work out how the original algorithm worked is:
float InvSqrt(float x)
{
float xhalf = 0.5f * x;
int i = *(int*)&x; // get bits for floating value
i = 0x5f375a86 - (i >> 1); // gives initial guess y0
x = *(float*)&i; // convert bits back to float
x = x * (1.5f - xhalf * x * x); // Newton step, repeating increases accuracy
return x;
}
In spite of this, his initial attempt a mathematically 'superior' version of id's sqrt (which came to almost the same constant) proved inferior to the one initially developed by Gary despite being mathematically much 'purer'. He couldn't explain why id's was so excellent iirc.
Of course these days, it turns out to be much slower than just using an FPU's sqrt (especially on 360/PS3), because swapping between float and int registers induces a load-hit-store, while the floating point unit can do reciprocal square root in hardware.
It just shows how optimizations have to evolve as the nature of underlying hardware changes.
Greg Hewgill and IllidanS4 gave a link with excellent mathematical explanation. I'll try to sum it up here for ones who don't want to go too much into details.
Any mathematical function, with some exceptions, can be represented by a polynomial sum:
y = f(x)
can be exactly transformed into:
y = a0 + a1*x + a2*(x^2) + a3*(x^3) + a4*(x^4) + ...
Where a0, a1, a2,... are constants. The problem is that for many functions, like square root, for exact value this sum has infinite number of members, it does not end at some x^n. But, if we stop at some x^n we would still have a result up to some precision.
So, if we have:
y = 1/sqrt(x)
In this particular case they decided to discard all polynomial members above second, probably because of calculation speed:
y = a0 + a1*x + [...discarded...]
And the task has now came down to calculate a0 and a1 in order for y to have the least difference from the exact value. They have calculated that the most appropriate values are:
a0 = 0x5f375a86
a1 = -0.5
So when you put this into equation you get:
y = 0x5f375a86 - 0.5*x
Which is the same as the line you see in the code:
i = 0x5f375a86 - (i >> 1);
Edit: actually here y = 0x5f375a86 - 0.5*x
is not the same as i = 0x5f375a86 - (i >> 1);
since shifting float as integer not only divides by two but also divides exponent by two and causes some other artifacts, but it still comes down to calculating some coefficients a0, a1, a2... .
At this point they've found out that this result's precision is not enough for the purpose. So they additionally did only one step of Newton's iteration to improve the result accuracy:
x = x * (1.5f - xhalf * x * x)
They could have done some more iterations in a loop, each one improving result, until required accuracy is met. This is exactly how it works in CPU/FPU! But it seems that only one iteration was enough, which was also a blessing for the speed. CPU/FPU does as many iterations as needed to reach the accuracy for the floating point number in which the result is stored and it has more general algorithm which works for all cases.
So in short, what they did is:
Use (almost) the same algorithm as CPU/FPU, exploit the improvement of initial conditions for the special case of 1/sqrt(x) and don't calculate all the way to precision CPU/FPU will go to but stop earlier, thus gaining in calculation speed.
I was curious to see what the constant was as a float so I simply wrote this bit of code and googled the integer that popped out.
long i = 0x5F3759DF;
float* fp = (float*)&i;
printf("(2^127)^(1/2) = %f\n", *fp);
//Output
//(2^127)^(1/2) = 13211836172961054720.000000
It looks like the constant is "An integer approximation to the square root of 2^127 better known by the hexadecimal form of its floating-point representation, 0x5f3759df" https://mrob.com/pub/math/numbers-18.html
On the same site it explains the whole thing. https://mrob.com/pub/math/numbers-16.html#le009_16
According to this nice article written a while back...
The magic of the code, even if you can't follow it, stands out as the i = 0x5f3759df - (i>>1); line. Simplified, Newton-Raphson is an approximation that starts off with a guess and refines it with iteration. Taking advantage of the nature of 32-bit x86 processors, i, an integer, is initially set to the value of the floating point number you want to take the inverse square of, using an integer cast. i is then set to 0x5f3759df, minus itself shifted one bit to the right. The right shift drops the least significant bit of i, essentially halving it.
It's a really good read. This is only a tiny piece of it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With