I am in a need of fast integer square root that does not involve any explicit division. The target RISC architecture can do operations like add
, mul
, sub
, shift
in one cycle (well - the operation's result is written in third cycle, really - but there's interleaving), so any Integer algorithm that uses these ops and is fast would be very appreciated.
This is what I have right now and I'm thinking that a binary search should be faster, since the following loop executes 16 times every single time (regardless of the value). I haven't debugged it extensively yet (but soon), so perhaps it's possible to have an early exit there:
unsigned short int int_sqrt32(unsigned int x) { unsigned short int res=0; unsigned short int add= 0x8000; int i; for(i=0;i<16;i++) { unsigned short int temp=res | add; unsigned int g2=temp*temp; if (x>=g2) { res=temp; } add>>=1; } return res; }
Looks like the current performance cost of the above [in the context of the target RISC] is a loop of 5 instructions (bitset, mul, compare, store, shift). Probably no space in cache to unroll fully (but this will be the prime candidate for a partial unroll [e.g. A loop of 4 rather than 16], for sure). So, the cost is 16*5 = 80 instructions (plus loop overhead, if not unrolled). Which, if fully interleaved, would cost only 80 (+2 for last instruction) cycles.
Can I get some other sqrt implementation (using only add, mul, bitshift, store/cmp) under 82 cycles?
FAQ:
There is no working C → RISC compiler for the platform. I will be porting the current reference C code into hand-written RISC ASM.
sqrt
is actually a bottleneck?No, there is no need for that. The target RISC chip is about twenty MHz, so every single instruction counts. The core loop (calculating the energy transfer form factor between the shooter and receiver patch), where this sqrt
is used, will be run ~1,000 times each rendering frame (assuming it will be fast enough, of course), up to 60,000 per second, and roughly 1,000,000 times for whole demo.
sqrt
?Yes, I did that already. In fact, I got rid of 2 sqrt
s already and lots of divisions (removed or replaced by shifting). I can see a huge performance boost (compared to the reference float version) even on my gigahertz notebook.
It's a real-time progressive-refinement radiosity renderer for the compo demo. The idea is to have one shooting cycle each frame, so it would visibly converge and look better with each rendered frame (e.g. Up 60-times per second, though the SW rasterizer won't probably be that fast [but at least it can run on the other chip in parallel with the RISC - so if it takes 2-3 frames to render the scene, the RISC will have worked through 2-3 frames of radiosity data, in parallel]).
Because radiosity is a slightly involved algorithm and I need the instant edit & continue debugging capability of Visual Studio. What I've done over the weekend in VS (couple hundred code changes to convert the floating-point math to integer-only) would take me 6 months on the target platform with only printing debugging".
Because it's 16-times slower on the target RISC than any of the following: mul, add, sub, shift, compare, load/store (which take just 1 cycle). So, it's used only when absolutely required (a couple times already, unfortunately, when shifting could not be used).
The engine needs other LUTs already and copying from main RAM to RISC's little cache is prohibitively expensive (and definitely not each and every frame). But, I could perhaps spare 128-256 Bytes if it gave me at least a 100-200% boost for sqrt
.
sqrt
?I managed to reduce it to mere unsigned 32-bit int (4,294,967,295)
EDIT1: I have ported two versions into the target RISC ASM, so I now have an exact count of ASM instructions during the execution (for the test scene).
Number of sqrt calls: 2,800.
Method1: The same method in this post (loop executing 16 times)
Method2: fred_sqrt (3c from http://www.azillionmonkeys.com/qed/sqroot.html)
Method1: 152.98 instructions per sqrt
Method2: 39.48 instructions per sqrt (with Final Rounding and 2 Newton iterations)
Method2: 21.01 instructions per sqrt (without Final Rounding and 2 Newton iterations)
Method2 uses LUT with 256 values, but since the target RISC can only use 32-bit access within its 4 KB cache, it actually takes 256*4 = 1 KB. But given its performance, I guess I will have to spare that 1 KB (out of 4).
Also, I have found out that there is NO visible visual artifact when I disable the Final rounding and two Newton iterations at the end (of Method2). Meaning, the precision of that LUT is apparently good enough. Who knew...
The final cost is then 21.01 instructions per sqrt, which is almost ~order of magnitude faster than the very first solution. There's also possibility of reducing it further by sacrificing few of the 32 available registers for the constants used for the conditions and jump labels (each condition must fill 2 registers - one with the actual constant (only values less than 16 are allowed within CMPQ instruction, larger ones must be put into register) we are comparing against and second for the jump to the else label (the then jump is fall-through), as the direct relative jump is only possible within ~10 instructions (impossible with such large if-then-else chain, other than innermost 2 conditions).
EDIT2: ASM micro-optimizations
While benchmarking, I added counters for each of the 26 If.Then.Else codeblocks, to see if there aren't any blocks executed most often. Turns out, that Blocks 0/10/11 are executed in 99.57%/99.57%/92.57% of cases. This means I can justify sacrificing 3 registers (out of 32) for those comparison constants (in those 3 blocks), e.g. r26 = $1.0000 r25 = $100.0000 r24 = $10.0000
This brought down the total instruction cost from 58,812 (avg:21.01) to 50,448 (avg:18.01)
So, now the average sqrt cost is just 18.01 ASM instructions (and there is no division!), though it will have to be inlined.
EDIT3: ASM micro-optimizations
Since we know that those 3 blocks (0/10/11) are executed most often, we can use local short jumps (16 Bytes in both directions, which is usually just couple of instructions (hence mostly useless), especially when the 6-byte MOVEI #jump_label, register is used during conditions) in those particular conditions. Of course, the Else condition will then incur additional 2 ops (that it would not have otherwise), but that's worth it. The block 10 will have to be swapped (Then block with Else block), which will make it harder to read and debug, but I documented the reasons profusely.
Now the total instruction cost (in test scene) is just 42,500 with an average of 15.18 ASM instructions per sqrt.
EDIT4: ASM micro-optimizations
Block 11 condition splits into innermost Blocks 12&13. It just so happens that those blocks don't need additional +1 math op, hence the local short jump can actually reach the Else block, if I merge bitshift right with the necessary bitshift left #2 (as all offsets within cache must be 32-bit). This saves on filling the jump register though I do need to sacrifice one more register r23 for the comparison value of $40.000.
The final cost is then 34,724 instructions with an average of 12.40 ASM instructions per sqrt.
I am also realizing that I could reshuffle the order of conditions (which will make the other range few ops more expensive, but that's happening for ~7% of cases only), favoring this particular range ($10.000, $40.000) first, saving on at least 1 or maybe even 2 conditions.
In which case, it should fall down to ~8.40 per sqrt.
I am realizing that the range depends directly on intensity of the light and the distance to the wall. Meaning, I have direct control over the RGB value of the light and distance from the wall. And while I would like the solution to be as generic as possible, given this realization (~12 ops per sqrt is mind-blowing), I will gladly sacrifice some flexibility in light colors if I can get sqrt this fast. Besides, there's maybe 10-15 different lights in whole demo, so I can simply find color combinations that eventually result in same sqrt range, but will get insanely fast sqrt. For sure, that's worth it to. And I still have a generic fallback (spanning entire int range) working just fine. Best of both worlds, really.
The square roots of the perfect squares (e.g., 0, 1, 4, 9, 16) are integers. In all other cases, the square roots of positive integers are irrational numbers, and hence have non-repeating decimals in their decimal representations.
The Math. sqrt() method finds the square root of the given number and the floor() method finds the closest integer of the square root value returned by sqrt() method. Later we have calculated the difference between these two to check whether the difference is zero or non-zero.
Have a look here.
For instance, at 3(a) there is this method, which is trivially adaptable to do a 64->32 bit square root, and also trivially transcribable to assembler:
/* by Jim Ulery */ static unsigned julery_isqrt(unsigned long val) { unsigned long temp, g=0, b = 0x8000, bshft = 15; do { if (val >= (temp = (((g << 1) + b)<<bshft--))) { g += b; val -= temp; } } while (b >>= 1); return g; }
No divisions, no multiplications, bit shifts only. However, the time taken will be somewhat unpredictable particularly if you use a branch (on ARM RISC conditional instructions would work).
In general, this page lists ways to calculate square roots. If you happen to want to produce a fast inverse square root (i.e. x**(-0.5)
), or are just interested in amazing ways to optimise code, take a look at this, this and this.
This is the same as yours, but with fewer ops. (I count 9 ops in the loop in your code, including test and increment i
in the for loop and 3 assignments, but perhaps some of those disappear when coded in ASM? There are 6 ops in the code below, if you count g*g>n
as two (no assignment)).
int isqrt(int n) { int g = 0x8000; int c = 0x8000; for (;;) { if (g*g > n) { g ^= c; } c >>= 1; if (c == 0) { return g; } g |= c; } }
I got it here. You can maybe eliminate a comparison if you unroll the loop and jump to the appropriate spot based on the highest non-zero bit in the input.
Update
I've been thinking more about using Newton's method. In theory, the number of bits of accuracy should double for each iteration. That means Newton's method is much worse than any of the other suggestions when there are few correct bits in the answer; however, the situation changes where there are a lot of correct bits in the answer. Considering that most suggestions seem to take 4 cycles per bit, that means that one iteration of Newton's method (16 cycles for division + 1 for addition + 1 for shift = 18 cycles) is not worthwhile unless it gives over 4 bits.
So, my suggestion is to build up 8 bits of the answer by one of the suggested methods (8*4 = 32 cycles) then perform one iteration of Newton's method (18 cycles) to double the number of bits to 16. That's a total of 50 cycles (plus maybe an extra 4 cycles to get 9 bits before applying Newton's method, plus maybe 2 cycles to overcome the overshoot occasionally experienced by Newton's method). That's a maximum of 56 cycles which as far as I can see rivals any of the other suggestions.
Second Update
I coded the hybrid algorithm idea. Newton's method itself has no overhead; you just apply and double the number of significant digits. The issue is to have a predictable number of significant digits before you apply Newton's method. For that, we need to figure out where the most significant bit of the answer will appear. Using a modification of the fast DeBruijn sequence method given by another poster, we can perform that calculation in about 12 cycles in my estimation. On the other hand, knowing the position of the msb of the answer speeds up all methods (average, not worst case), so it seems worthwhile no matter what.
After calculating the msb of the answer, I run a number of rounds of the algorithm suggested above, then finish it off with one or two rounds of Newton's method. We decide when to run Newton's method by the following calculation: one bit of the answer takes about 8 cycles according to calculation in the comments; one round of Newton's method takes about 18 cycles (division, addition, and shift, and maybe assignment), so we should only run Newton's method if we're going to get at least three bits out of it. So for 6 bit answers, we can run the linear method 3 times to get 3 significant bits, then run Newton's method 1 time to get another 3. For 15 bit answers, we run the linear method 4 times to get 4 bits, then Newton's method twice to get another 4 then another 7. And so on.
Those calculations depend on knowing exactly how many cycles are required to get a bit by the linear method vs. how many are required by Newton's method. If the "economics" change, e.g., by discovering a faster way to build up bits in a linear fashion, the decision of when to invoke Newton's method will change.
I unrolled the loops and implemented the decisions as switches, which I hope will translate into fast table lookups in assembly. I'm not absolutely sure that I've got the minimum number of cycles in each case, so maybe further tuning is possible. E.g., for s=10, you can try to get 5 bits then apply Newton's method once instead of twice.
I've tested the algorithm thoroughly for correctness. Some additional minor speedups are possible if you're willing to accept slightly incorrect answers in some cases. At least two cycles are used after applying Newton's method to correct an off-by-one error that occurs with numbers of the form m^2-1
. And a cycle is used testing for input 0 at the beginning, as the algorithm can't handle that input. If you know you're never going to take the square root of zero you can eliminate that test. Finally, if you only need 8 significant bits in the answer, you can drop one of the Newton's method calculations.
#include <inttypes.h> #include <stdint.h> #include <stdbool.h> #include <stdio.h> uint32_t isqrt1(uint32_t n); int main() { uint32_t n; bool it_works = true; for (n = 0; n < UINT32_MAX; ++n) { uint32_t sr = isqrt1(n); if ( sr*sr > n || ( sr < 65535 && (sr+1)*(sr+1) <= n )) { it_works = false; printf("isqrt(%" PRIu32 ") = %" PRIu32 "\n", n, sr); } } if (it_works) { printf("it works\n"); } return 0; } /* table modified to return shift s to move 1 to msb of square root of x */ /* static const uint8_t debruijn32[32] = { 0, 31, 9, 30, 3, 8, 13, 29, 2, 5, 7, 21, 12, 24, 28, 19, 1, 10, 4, 14, 6, 22, 25, 20, 11, 15, 23, 26, 16, 27, 17, 18 }; */ static const uint8_t debruijn32[32] = { 15, 0, 11, 0, 14, 11, 9, 1, 14, 13, 12, 5, 9, 3, 1, 6, 15, 10, 13, 8, 12, 4, 3, 5, 10, 8, 4, 2, 7, 2, 7, 6 }; /* based on CLZ emulation for non-zero arguments, from * http://stackoverflow.com/questions/23856596/counting-leading-zeros-in-a-32-bit-unsigned-integer-with-best-algorithm-in-c-pro */ uint8_t shift_for_msb_of_sqrt(uint32_t x) { x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; x++; return debruijn32 [x * 0x076be629 >> 27]; } uint32_t isqrt1(uint32_t n) { if (n==0) return 0; uint32_t s = shift_for_msb_of_sqrt(n); uint32_t c = 1 << s; uint32_t g = c; switch (s) { case 9: case 5: if (g*g > n) { g ^= c; } c >>= 1; g |= c; case 15: case 14: case 13: case 8: case 7: case 4: if (g*g > n) { g ^= c; } c >>= 1; g |= c; case 12: case 11: case 10: case 6: case 3: if (g*g > n) { g ^= c; } c >>= 1; g |= c; case 2: if (g*g > n) { g ^= c; } c >>= 1; g |= c; case 1: if (g*g > n) { g ^= c; } c >>= 1; g |= c; case 0: if (g*g > n) { g ^= c; } } /* now apply one or two rounds of Newton's method */ switch (s) { case 15: case 14: case 13: case 12: case 11: case 10: g = (g + n/g) >> 1; case 9: case 8: case 7: case 6: g = (g + n/g) >> 1; } /* correct potential error at m^2-1 for Newton's method */ return (g==65536 || g*g>n) ? g-1 : g; }
In light testing on my machine (which admittedly is nothing like yours), the new isqrt1
routine runs about 40% faster on average than the previous isqrt
routine I gave.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With