Is there some way to improve reciprocal (division 1 over X) with respect to speed, if the precision is not crucial?
So, I need to calculate 1/X. Is there some workaround so I lose precision but do it faster?
I believe that what he was looking for is a more efficient way to approximate 1.0/x instead of some technical definition of approximation that states that you could use 1 as a very imprecise answer. I also believe that this satisfies that.
#ifdef __cplusplus
#include <cstdint>
#else
#include <stdint.h>
#endif
__inline__ double __attribute__((const)) reciprocal( double x ) {
union {
double dbl;
#ifdef __cplusplus
std::uint_least64_t ull;
#else
uint_least64_t ull;
#endif
} u;
u.dbl = x;
u.ull = ( 0xbfcdd6a18f6a6f52ULL - u.ull ) >> 1;
// pow( x, -0.5 )
u.dbl *= u.dbl; // pow( pow(x,-0.5), 2 ) = pow( x, -1 ) = 1.0 / x
return u.dbl;
}
__inline__ float __attribute__((const)) reciprocal( float x ) {
union {
float single;
#ifdef __cplusplus
std::uint_least32_t uint;
#else
uint_least32_t uint;
#endif
} u;
u.single = x;
u.uint = ( 0xbe6eb3beU - u.uint ) >> 1;
// pow( x, -0.5 )
u.single *= u.single; // pow( pow(x,-0.5), 2 ) = pow( x, -1 ) = 1.0 / x
return u.single;
}
Hmm....... I wounder if the CPU manufactures knew you could approximate the reciprocal with only a single multiply, subtraction, and bit-shift when they designed the CPU.... hmm.........
As for bench-marking, the hardware x2 instructions combined with the hardware subtraction instructions are just as fast as hardware 1.0/x instruction on modern day computers (my benchmarks were on an Intel i7, but I would assume similar results for other processors). However, if this algorithm were implemented into the hardware as a new assembly instruction, then the increase in speed would probably be good enough for this instruction to be quite practical.
For more information about this method, this implementation is based off the wonderful "fast" inverse square root algorithm.
As Pharap brought to my attention, reading an inactive property from a union is undefined behavior, so there are two possible solutions that I have devised from his helpful comments to avoid undefined behavior. The first solution seems more like an unsavory trick to get around a language semantic that is practically no better than the original solution.
#ifdef __cplusplus
#include <cstdint>
#else
#include <stdint.h>
#endif
__inline__ double __attribute__((const)) reciprocal( double x ) {
union {
double dbl[2];
#ifdef __cplusplus
std::uint_least64_t ull[2];
#else
uint_least64_t ull[2];
#endif
} u;
u.dbl[0] = x; // dbl is now the active property, so only dbl can be read now
u.ull[1] = 0;//trick to set ull to the active property so that ull can be read
u.ull][0] = ( 0xbfcdd6a18f6a6f52ULL - u.ull[0] ) >> 1;
u.dbl[1] = 0; // now set dbl to the active property so that it can be read
u.dbl[0] *= u.dbl[0];
return u.dbl[0];
}
__inline__ float __attribute__((const)) reciprocal( float x ) {
union {
float flt[2];
#ifdef __cplusplus
std::uint_least32_t ull[2];
#else
uint_least32_t ull[2];
#endif
} u;
u.flt[0] = x; // now flt is active
u.uint[1] = 0; // set uint to be active for reading and writing
u.uint[0] = ( 0xbe6eb3beU - u.uint[0] ) >> 1;
u.flt[1] = 0; // set flt to be active for reading and writing
u.flt[0] *= u.flt[0];
return u.flt[0];
}
The second possible solution is much more palatable because it gets rid of unions altogether. However, this solution will be much slower if it is not properly optimized by the compiler. But, on the upside, the solution below will be completely agnostic of byte-order provided:
#ifdef __cplusplus
#include <cstdint>
#include <cstring>
#define stdIntWithEightBits std::uint8_t
#define stdIntSizeOfFloat std::uint32_t
#define stdIntSizeOfDouble std::uint64_t
#else
#include <stdint.h>
#include <string.h>
#define stdIntWithEightBits uint8_t
#define stdIntSizeOfFloat uint32_t
#define stdIntSizeOfDouble uint64_t
#endif
__inline__ double __attribute__((const)) reciprocal( double x ) {
double byteIndexFloat = 1.1212798184631136e-308;//00 08 10 18 20 28 30 38 bits
stdIntWithEightBits* byteIndexs = reinterpret_cast<stdIntWithEightBits*>(&byteIndexFloat);
stdIntWithEightBits* inputBytes = reinterpret_cast<stdIntWithEightBits*>(&x);
stdIntSizeOfDouble inputAsUll = (
(inputBytes[0] << byteIndexs[0]) |
(inputBytes[1] << byteIndexs[1]) |
(inputBytes[2] << byteIndexs[2]) |
(inputBytes[3] << byteIndexs[3]) |
(inputBytes[4] << byteIndexs[4]) |
(inputBytes[5] << byteIndexs[5]) |
(inputBytes[6] << byteIndexs[6]) |
(inputBytes[7] << byteIndexs[7])
);
inputAsUll = ( 0xbfcdd6a18f6a6f52ULL - inputAsUll ) >> 1;
double outputDouble;
const stdIntWithEightBits outputBytes[] = {
inputAsUll >> byteIndexs[0],
inputAsUll >> byteIndexs[1],
inputAsUll >> byteIndexs[2],
inputAsUll >> byteIndexs[3],
inputAsUll >> byteIndexs[4],
inputAsUll >> byteIndexs[5],
inputAsUll >> byteIndexs[6],
inputAsUll >> byteIndexs[7]
};
memcpy(&outputDouble, &outputBytes, 8);
return outputDouble * outputDouble;
}
__inline__ float __attribute__((const)) reciprocal( float x ) {
float byteIndexFloat = 7.40457e-40; // 0x00 08 10 18 bits
stdIntWithEightBits* byteIndexs = reinterpret_cast<stdIntWithEightBits*>(&byteIndexFloat);
stdIntWithEightBits* inputBytes = reinterpret_cast<stdIntWithEightBits*>(&x);
stdIntSizeOfFloat inputAsInt = (
(inputBytes[0] << byteIndexs[0]) |
(inputBytes[1] << byteIndexs[1]) |
(inputBytes[2] << byteIndexs[2]) |
(inputBytes[3] << byteIndexs[3])
);
inputAsInt = ( 0xbe6eb3beU - inputAsInt ) >> 1;
float outputFloat;
const stdIntWithEightBits outputBytes[] = {
inputAsInt >> byteIndexs[0],
inputAsInt >> byteIndexs[1],
inputAsInt >> byteIndexs[2],
inputAsInt >> byteIndexs[3]
};
memcpy(&outputFloat, &outputBytes, 4);
return outputFloat * outputFloat;
}
Disclaimer: Lastly, please note that I am more of a novice in C++. As such, I welcome with wide open arms any best-practices, proper-formatting, or implication-clarity edits to the end of both improving the quality of this answer for all who read it and expanding my knowledge of C++ for all my years to come.
First, make sure this isn't a case of premature optimization. Do you know that this is your bottleneck?
As Mystical says, 1/x can be calculated very quickly. Make sure you're not using double
datatype for either the 1 or the divisor. Floats are much faster.
That said, benchmark, benchmark, benchmark. Don't waste your time spending hours on numerical theory just to discover the source of the poor performance is IO access.
First of all, if you turn on compiler optimizations, the compiler is likely to optimize the calculation if possible (to pull it out of a loop, for example). To see this optimization, you need to build and run in Release mode.
Division may be heavier than multiplication (but a commenter pointed out that reciprocals are just as fast as multiplication on modern CPUs, in which case, this isn't correct for your case), so if you do have 1/X
appearing somewhere inside a loop (and more than once), you can assist by caching the result inside the loop (float Y = 1.0f/X;
) and then using Y
. (The compiler optimization might do this in any case.)
Also, certain formulas can be redesigned to remove division or other inefficient computations. For that, you could post the larger computation being performed. Even there, the program or algorithm itself can sometimes be restructured to prevent the need for hitting time-consuming loops from being hit as frequently.
How much accuracy can be sacrificed? If on the off chance you only need an order of magnitude, you can get that easily using the modulus operator or bitwise operations.
However, in general, there's no way to speed up division. If there were, compilers would already be doing it.
I’ve bench tested these methods on Arduino NANO for speed and 'accuracy'.
Basic calculation was to setup variables, Y = 15,000,000 and Z = 65,535
(in my real case, Y is a constant and Z can vary between 65353 and 3000 so a useful test)
Calc time on Arduino was established by putting pin low, then high as calc made and then low again and comparing on times with logic analyser. FOR 100 CYCLES.
With variables as unsigned integers:-
Y * Z takes 0.231 msec
Y / Z takes 3.867 msec.
With variables as floats:-
Y * Z takes 1.066 msec
Y / Z takes 4.113 msec.
Basic Bench Mark and ( 15,000,000/65535 = 228.885 via calculator.)
Using {Jack G’s} float reciprocal algorithm:
Y * reciprocal(Z) takes 1.937msec which is a good improvement, but accuracy less so 213.68.
Using {nimig18’s} float inv_fast:
Y* inv_fast(Z) takes 5.501 msec accuracy 228.116 with single iteration
Y* inv_fast(Z) takes 7.895 msec accuracy 228.883 with second iteration
Using Wikipedia's Q_rsqrt (pointed to by {Jack G})
Y * Q*rsqrt(Z) takes 6.104 msec accuracy 228.116 with single iteration
All entertaining but ultimately disappointing!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With