In an algorithm I am using, most of the time is spent in the evaluation of square roots using the sqrt
function in cmath
. In my application I only need the integer part of the square root of integers, for which sqrt
seems overly complex, and I tried to do a fast implementation following the manual algorithm for calculating square roots, which becomes very easy in binary, to get
const int BITLEN = sizeof(unsigned int) * 8;
bool DUMMYBOOL;
// digit in base 4 (two bits) starting at bit position i in n
inline int digit(unsigned int n, int i)
{
return (n >> (BITLEN - i - 2)) & 3;
}
inline int isqrt(unsigned int n, bool* isSquare = &DUMMYBOOL)
{
int answer = 0;
int remaining = 0;
for (int i = 0; i < BITLEN; i += 2)
{
remaining = (remaining << 2) + digit(n,i);
answer <<= 1;
int correction = (answer << 1) + 1;
if (correction <= remaining)
{
remaining -= correction;
answer++;
}
}
*isSquare = remaining == 0;
return answer;
}
(Note that it also tells me whether the number was a perfect square). This is fast, but still not as fast as the builtin sqrt
. From what I understand, the sqrt
is likely directly implemented in hardware (x86_64 with gcc on Linux), and will be hard to beat, and I'm OK with that.
I am still curious though if this function could be optimized further. One thing I tried was to adapt the BITLEN
to the size of n
using __builtin_clz
, but that actually seems to slow it down, even if on average the numbers only have 24 bits or so.
A few words about lookup tables with pre-computed values of interest. In this use case, one defines the following
template<int NBITS>
struct ISqrt_LUT
{
static_assert (NBITS<=32);
static const auto Nbits = NBITS;
static const auto Nmax = 1UL<<NBITS;
ISqrt_LUT() : lut(Nmax) {
for (std::size_t i=0; i<lut.size(); i++) { lut[i] = sqrt(i); }
}
auto operator() (std::uint32_t n) const {
return lut[n];
}
std::vector<std::uint16_t> lut;
};
where all values up to 1<<NBITS
are computed only once and then can be retrieved as many times as required through the operator()
method.
Demo
One could think that one can't do better than a lookup table since the values are already computed. For instance, if we try all successive values from 1
to 1<<NBITS
, one gets the following exec times (on my system)
[native / contiguous values] duration (ms): 162
[lookup table / contiguous values] duration (ms): 40
where native
refers to a call to the sqrt
function. In such a case, one gets a real speedup with a LUT.
OTOH, when the values are random, one gets
[native / shuffle values ] duration (ms): 163
[lookup table / shuffle values ] duration (ms): 476
so the LUT takes more time than the native approach: indeed, one gets many costly cache misses which makes the LUT not interesting in that case.
So, LUT are sometimes an interesting tool but can be deceptive when they are too big to fit L1 (or L2, L3) cache.
consteval
instead of const
Declaring a variable as const
or constexpr
does not guarantie that it will be evaluated at compile time. Use consteval
for that.
consteval int BITLEN() { return sizeof(unsigned int) * 8; }
consteval int BITLEN_M2() { return BITLEN - 2; }
And then remember to change the places you use BITLEN
to BITLEN()
.
digit()
and compute it directlyDeclaring a function as inline
does not guarantee that it will be inlined. And avoid subtracting twice, by using BITLEN_M2
instead of BITLEN
. To achive this, change remaining = (remaining << 2) + digit(n,i)
to remaining = (remaining << 2) | ((n >> (BITLEN_M2() - i)) & 3)
.
Change (answer << 1) + 1
to (answer << 1) | 1
.
You assign to answer
twice, once before a conditional, and then inside a conditional. By introducing a local variable and only assigning to answer
inside the conditional, you could get a possible speed-up.
DUMMYBOOL
Make the 2nd parameter not-optional, to prevent the default assignment of a parameter (to a global variable that isn't constant, which may prevent compiler optimization) and remember to never call the function with the 2nd parameter equal to null
.
consteval int BITLEN() { return sizeof(unsigned int) * 8; }
consteval int BITLEN_M2() { return BITLEN() - 2; }
int isqrt(unsigned int n, bool* isSquare)
{
int answer = 0;
int remaining = 0;
for (int i = 0; i < BITLEN(); i += 2) {
int bits = (n >> (BITLEN_M2() - i)) & 3;
remaining = (remaining << 2) + bits;
int trial = (answer << 1) | 1;
if (trial <= remaining) {
remaining -= trial;
answer = trial;
} else {
answer <<= 1;
}
}
*isSquare = (remaining == 0);
return answer;
}
Should probably change all the int
to unsigned int
too, but that's not performance related.
I was thinking that the algorithm could be created without the conditional. And I see that some of the other answers that arrived after mine are very good. When reading them I think, why didn't I think of that :D
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With