Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster divisibility test than % operator?

Tags:

I noticed a curious thing on my computer.* The handwritten divisibility test is significantly faster than the % operator. Consider the minimal example:

* AMD Ryzen Threadripper 2990WX, GCC 9.2.0

static int divisible_ui_p(unsigned int m, unsigned int a) {     if (m <= a) {         if (m == a) {             return 1;         }          return 0;     }      m += a;      m >>= __builtin_ctz(m);      return divisible_ui_p(m, a); } 

The example is limited by odd a and m > 0. However, it can be easily generalized to all a and m. The code just converts the division to a series of additions.

Now consider the test program compiled with -std=c99 -march=native -O3:

    for (unsigned int a = 1; a < 100000; a += 2) {         for (unsigned int m = 1; m < 100000; m += 1) { #if 1             volatile int r = divisible_ui_p(m, a); #else             volatile int r = (m % a == 0); #endif         }     } 

... and the results on my computer:

| implementation     | time [secs] | |--------------------|-------------| | divisible_ui_p     |    8.52user | | builtin % operator |   17.61user | 

Therefore more than 2 times faster.

The question: Can you tell me how the code behaves on your machine? Is it missed optimization opportunity in GCC? Can you do this test even faster?


UPDATE: As requested, here is a minimal reproducible example:

#include <assert.h>  static int divisible_ui_p(unsigned int m, unsigned int a) {     if (m <= a) {         if (m == a) {             return 1;         }          return 0;     }      m += a;      m >>= __builtin_ctz(m);      return divisible_ui_p(m, a); }  int main() {     for (unsigned int a = 1; a < 100000; a += 2) {         for (unsigned int m = 1; m < 100000; m += 1) {             assert(divisible_ui_p(m, a) == (m % a == 0)); #if 1             volatile int r = divisible_ui_p(m, a); #else             volatile int r = (m % a == 0); #endif         }     }      return 0; } 

compiled with gcc -std=c99 -march=native -O3 -DNDEBUG on AMD Ryzen Threadripper 2990WX with

gcc --version gcc (Gentoo 9.2.0-r2 p3) 9.2.0 

UPDATE2: As requested, the version that can handle any a and m (if you also want to avoid integer overflow, the test has to be implemented with integer type twice as long as the input integers):

int divisible_ui_p(unsigned int m, unsigned int a) { #if 1     /* handles even a */     int alpha = __builtin_ctz(a);      if (alpha) {         if (__builtin_ctz(m) < alpha) {             return 0;         }          a >>= alpha;     } #endif      while (m > a) {         m += a;         m >>= __builtin_ctz(m);     }      if (m == a) {         return 1;     }  #if 1     /* ensures that 0 is divisible by anything */     if (m == 0) {         return 1;     } #endif      return 0; } 
like image 618
DaBler Avatar asked Mar 18 '20 16:03

DaBler


1 Answers

What you’re doing is called strength reduction: replacing an expensive operation with a series of cheap ones.

The mod instruction on many CPUs is slow, because it historically was not tested in several common benchmarks and the designers therefore optimized other instructions instead. This algorithm will perform worse if it has to do many iterations, and % will perform better on a CPU where it needs only two clock cycles.

Finally, be aware that there are many shortcuts to take the remainder of division by specific constants. (Although compilers will generally take care of this for you.)

like image 169
Davislor Avatar answered Nov 13 '22 03:11

Davislor