I noticed a curious thing on my computer.* The handwritten divisibility test is significantly faster than the %
operator. Consider the minimal example:
* AMD Ryzen Threadripper 2990WX, GCC 9.2.0
static int divisible_ui_p(unsigned int m, unsigned int a) { if (m <= a) { if (m == a) { return 1; } return 0; } m += a; m >>= __builtin_ctz(m); return divisible_ui_p(m, a); }
The example is limited by odd a
and m > 0
. However, it can be easily generalized to all a
and m
. The code just converts the division to a series of additions.
Now consider the test program compiled with -std=c99 -march=native -O3
:
for (unsigned int a = 1; a < 100000; a += 2) { for (unsigned int m = 1; m < 100000; m += 1) { #if 1 volatile int r = divisible_ui_p(m, a); #else volatile int r = (m % a == 0); #endif } }
... and the results on my computer:
| implementation | time [secs] | |--------------------|-------------| | divisible_ui_p | 8.52user | | builtin % operator | 17.61user |
Therefore more than 2 times faster.
The question: Can you tell me how the code behaves on your machine? Is it missed optimization opportunity in GCC? Can you do this test even faster?
UPDATE: As requested, here is a minimal reproducible example:
#include <assert.h> static int divisible_ui_p(unsigned int m, unsigned int a) { if (m <= a) { if (m == a) { return 1; } return 0; } m += a; m >>= __builtin_ctz(m); return divisible_ui_p(m, a); } int main() { for (unsigned int a = 1; a < 100000; a += 2) { for (unsigned int m = 1; m < 100000; m += 1) { assert(divisible_ui_p(m, a) == (m % a == 0)); #if 1 volatile int r = divisible_ui_p(m, a); #else volatile int r = (m % a == 0); #endif } } return 0; }
compiled with gcc -std=c99 -march=native -O3 -DNDEBUG
on AMD Ryzen Threadripper 2990WX with
gcc --version gcc (Gentoo 9.2.0-r2 p3) 9.2.0
UPDATE2: As requested, the version that can handle any a
and m
(if you also want to avoid integer overflow, the test has to be implemented with integer type twice as long as the input integers):
int divisible_ui_p(unsigned int m, unsigned int a) { #if 1 /* handles even a */ int alpha = __builtin_ctz(a); if (alpha) { if (__builtin_ctz(m) < alpha) { return 0; } a >>= alpha; } #endif while (m > a) { m += a; m >>= __builtin_ctz(m); } if (m == a) { return 1; } #if 1 /* ensures that 0 is divisible by anything */ if (m == 0) { return 1; } #endif return 0; }
What you’re doing is called strength reduction: replacing an expensive operation with a series of cheap ones.
The mod instruction on many CPUs is slow, because it historically was not tested in several common benchmarks and the designers therefore optimized other instructions instead. This algorithm will perform worse if it has to do many iterations, and %
will perform better on a CPU where it needs only two clock cycles.
Finally, be aware that there are many shortcuts to take the remainder of division by specific constants. (Although compilers will generally take care of this for you.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With