Floored division is when the result is always floored down (towards −∞), not towards 0: <img src="https://i.stack.imgur.com/ME0ZR.png" alt="division types"> Is it possible to efficiently implement floored or euclidean integer division in C/C++? (the obvious solution is to check the dividend's sign)

I've written a test program to benchmark the ideas presented here: <pre class="prettyprint"><code>#include <stdio.h> #include <stdlib.h> #include <math.h> #include <windows.h> #define N 10000000 #define M 100 int dividends[N], divisors[N], results[N]; __forceinline int floordiv_signcheck(int a, int b) { return (a<0 ? a-(b-1) : a) / b; } __forceinline int floordiv_signcheck2(int a, int b) { return (a - (a<0 ? b-1 : 0)) / b; } __forceinline int floordiv_signmultiply(int a, int b) { return (a + (a>>(sizeof(a)*8-1))*(b-1)) / b; } __forceinline int floordiv_floatingpoint(int a, int b) { // I imagine that the call to floor can be replaced to a cast // if you can get FPU rounding control to work (I couldn't). return floor((double)a / b); } void main() { for (int i=0; i<N; i++) { dividends[i] = rand(); do divisors[i] = rand(); while (divisors[i]==0); } LARGE_INTEGER t0, t1; QueryPerformanceCounter(&t0); for (int j=0; j<M; j++) for (int i=0; i<N; i++) results[i] = floordiv_signcheck(dividends[i], divisors[i]); QueryPerformanceCounter(&t1); printf("signcheck : %9llu\n", t1.QuadPart-t0.QuadPart); QueryPerformanceCounter(&t0); for (int j=0; j<M; j++) for (int i=0; i<N; i++) results[i] = floordiv_signcheck2(dividends[i], divisors[i]); QueryPerformanceCounter(&t1); printf("signcheck2 : %9llu\n", t1.QuadPart-t0.QuadPart); QueryPerformanceCounter(&t0); for (int j=0; j<M; j++) for (int i=0; i<N; i++) results[i] = floordiv_signmultiply(dividends[i], divisors[i]); QueryPerformanceCounter(&t1); printf("signmultiply : %9llu\n", t1.QuadPart-t0.QuadPart); QueryPerformanceCounter(&t0); for (int j=0; j<M; j++) for (int i=0; i<N; i++) results[i] = floordiv_floatingpoint(dividends[i], divisors[i]); QueryPerformanceCounter(&t1); printf("floatingpoint: %9llu\n", t1.QuadPart-t0.QuadPart); } </code></pre> Results: <pre class="prettyprint"><code>signcheck : 61458768 signcheck2 : 61284370 signmultiply : 61625076 floatingpoint: 287315364 </code></pre> So, according to my results, checking the sign is the fastest: <pre class="prettyprint"><code>(a - (a<0 ? b-1 : 0)) / b </code></pre>

I'm revisiting this question five years later, as this is relevant for me too. I did some performance measurements on two pure-C versions and two inline-assembly versions for x86-64, and the results may be interesting. The tested variants of floored division are: <ol> <li>The implementation I've been using for some time now;</li> <li>The slight variant on that presented above which only uses one division;</li> <li>The previous one, but hand-implemented in inline-assembly; and</li> <li>A <code>CMOV</code> version implemented in assembly.</li> </ol> The following is my benchmark program: <pre class="prettyprint"><code>#include <stdio.h> #include <stdlib.h> #include <sys/time.h> #ifndef VARIANT #define VARIANT 3 #endif #if VARIANT == 0 #define floordiv(a, b) (((a) < 0)?((((a) + 1) / (b)) - 1):((a) / (b))) #elif VARIANT == 1 #define floordiv(a, b) ((((a) < 0)?((a) - ((b) - 1)):(a)) / (b)) #elif VARIANT == 2 #define floordiv(a, b) ({ \ int result; \ asm("test %%eax, %%eax; jns 1f; sub %1, %%eax;" \ "add $1, %%eax; 1: cltd; idivl %1;" \ : "=a" (result) \ : "r" (b), \ "0" (a) \ : "rdx"); \ result;}) #elif VARIANT == 3 #define floordiv(a, b) ({ \ int result; \ asm("mov %%eax, %%edx; sub %1, %%edx; add $1, %%edx;" \ "test %%eax, %%eax; cmovs %%edx, %%eax; cltd;" \ "idivl %1;" \ : "=a" (result) \ : "r" (b), \ "0" (a) \ : "rdx"); \ result;}) #endif double ntime(void) { struct timeval tv; gettimeofday(&tv, NULL); return(tv.tv_sec + (((double)tv.tv_usec) / 1000000.0)); } void timediv(int n, int *p, int *q, int *r) { int i; for(i = 0; i < n; i++) r[i] = floordiv(p[i], q[i]); } int main(int argc, char **argv) { int n, i, *q, *p, *r; double st; n = 10000000; p = malloc(sizeof(*p) * n); q = malloc(sizeof(*q) * n); r = malloc(sizeof(*r) * n); for(i = 0; i < n; i++) { p[i] = (rand() % 1000000) - 500000; q[i] = (rand() % 1000000) + 1; } st = ntime(); for(i = 0; i < 100; i++) timediv(n, p, q, r); printf("%g\n", ntime() - st); return(0); } </code></pre> I compiled this with <code>gcc -march=native -Ofast</code> using GCC 4.9.2, and the results, on my Core i5-2400, were as follows. The results are fairly reproducible from run to run -- they always land in the same order, at least. <ul> <li>Variant 0: 7.21 seconds</li> <li>Variant 1: 7.26 seconds</li> <li>Variant 2: 6.73 seconds</li> <li>Variant 3: 4.32 seconds</li> </ul> So the <code>CMOV</code> implementation blows the others out of the water, at least. What surprises me is that variant 2 out-does its pure-C version (variant 1) by a fairly wide margin. I'd have thought the compiler should be able to emit code at least as efficient as mine. Here are some other platforms, for comparison: AMD Athlon 64 X2 4200+, GCC 4.7.2: <ul> <li>Variant 0: 26.33 seconds</li> <li>Variant 1: 25.38 seconds</li> <li>Variant 2: 25.19 seconds</li> <li>Variant 3: 22.39 seconds</li> </ul> Xeon E3-1271 v3, GCC 4.9.2: <ul> <li>Variant 0: 5.95 seconds</li> <li>Variant 1: 5.62 seconds</li> <li>Variant 2: 5.40 seconds</li> <li>Variant 3: 3.44 seconds</li> </ul> As a final note, I should perhaps warn against taking the apparent performance advantage of the <code>CMOV</code> version too seriously, because in the real world, the branch in the other versions will probably not be as completely random as in this benchmark, and if the branch predictor can do a reasonable job, the branching versions may turn out to be better. However, the realities of that will depend quite a bit on the data that are being used in practice, and so is probably pointless to try and do any generic benchmark of.

Efficiently implementing floored / euclidean integer division

Q: Does integer division round or truncate?

As written, you're performing integer arithmetic, which automatically just truncates any decimal results.

Q: Why is floating point division faster than integer division?

Intel's desktop/server class processors do FP divisions much faster due to AVX (being capable of 8 32-bit FP divisions in parallel) compared to integer divisions. Their Atom processors aren't as good, being able to do 4 32-bit FP divisions in parallel but with a long latency but still faster than integer divisions.

2 Answers

I've written a test program to benchmark the ideas presented here:

#include <stdio.h> #include <stdlib.h> #include <math.h> #include <windows.h>  #define N 10000000 #define M 100  int dividends[N], divisors[N], results[N];  __forceinline int floordiv_signcheck(int a, int b) {     return (a<0 ? a-(b-1) : a) / b; }  __forceinline int floordiv_signcheck2(int a, int b) {     return (a - (a<0 ? b-1 : 0)) / b; }  __forceinline int floordiv_signmultiply(int a, int b) {     return (a + (a>>(sizeof(a)*8-1))*(b-1)) / b; }  __forceinline int floordiv_floatingpoint(int a, int b) {     // I imagine that the call to floor can be replaced to a cast     // if you can get FPU rounding control to work (I couldn't).     return floor((double)a / b); }  void main() {     for (int i=0; i<N; i++)     {         dividends[i] = rand();         do             divisors[i] = rand();         while (divisors[i]==0);     }      LARGE_INTEGER t0, t1;      QueryPerformanceCounter(&t0);     for (int j=0; j<M; j++)         for (int i=0; i<N; i++)             results[i] = floordiv_signcheck(dividends[i], divisors[i]);     QueryPerformanceCounter(&t1);     printf("signcheck    : %9llu\n", t1.QuadPart-t0.QuadPart);      QueryPerformanceCounter(&t0);     for (int j=0; j<M; j++)         for (int i=0; i<N; i++)             results[i] = floordiv_signcheck2(dividends[i], divisors[i]);     QueryPerformanceCounter(&t1);     printf("signcheck2   : %9llu\n", t1.QuadPart-t0.QuadPart);      QueryPerformanceCounter(&t0);     for (int j=0; j<M; j++)         for (int i=0; i<N; i++)             results[i] = floordiv_signmultiply(dividends[i], divisors[i]);     QueryPerformanceCounter(&t1);     printf("signmultiply : %9llu\n", t1.QuadPart-t0.QuadPart);      QueryPerformanceCounter(&t0);     for (int j=0; j<M; j++)         for (int i=0; i<N; i++)             results[i] = floordiv_floatingpoint(dividends[i], divisors[i]);     QueryPerformanceCounter(&t1);     printf("floatingpoint: %9llu\n", t1.QuadPart-t0.QuadPart); }

Results:

signcheck    :  61458768 signcheck2   :  61284370 signmultiply :  61625076 floatingpoint: 287315364

So, according to my results, checking the sign is the fastest:

(a - (a<0 ? b-1 : 0)) / b

158

answered Sep 22 '22 03:09

Vladimir Panteleev

I'm revisiting this question five years later, as this is relevant for me too. I did some performance measurements on two pure-C versions and two inline-assembly versions for x86-64, and the results may be interesting.

The tested variants of floored division are:

The implementation I've been using for some time now;
The slight variant on that presented above which only uses one division;
The previous one, but hand-implemented in inline-assembly; and
A CMOV version implemented in assembly.

The following is my benchmark program:

#include <stdio.h> #include <stdlib.h> #include <sys/time.h>  #ifndef VARIANT #define VARIANT 3 #endif  #if VARIANT == 0 #define floordiv(a, b) (((a) < 0)?((((a) + 1) / (b)) - 1):((a) / (b))) #elif VARIANT == 1 #define floordiv(a, b) ((((a) < 0)?((a) - ((b) - 1)):(a)) / (b)) #elif VARIANT == 2 #define floordiv(a, b) ({                                   \     int result;                                             \     asm("test %%eax, %%eax; jns 1f; sub %1, %%eax;"         \         "add $1, %%eax; 1: cltd; idivl %1;"                 \         : "=a" (result)                                     \         : "r" (b),                                          \           "0" (a)                                           \         : "rdx");                                           \     result;}) #elif VARIANT == 3 #define floordiv(a, b) ({                                           \     int result;                                                     \     asm("mov %%eax, %%edx; sub %1, %%edx; add $1, %%edx;"           \         "test %%eax, %%eax; cmovs %%edx, %%eax; cltd;"              \         "idivl %1;"                                                 \         : "=a" (result)                                             \         : "r" (b),                                                  \           "0" (a)                                                   \         : "rdx");                                                   \     result;}) #endif  double ntime(void) {     struct timeval tv;      gettimeofday(&tv, NULL);     return(tv.tv_sec + (((double)tv.tv_usec) / 1000000.0)); }  void timediv(int n, int *p, int *q, int *r) {     int i;      for(i = 0; i < n; i++)         r[i] = floordiv(p[i], q[i]); }  int main(int argc, char **argv) {     int n, i, *q, *p, *r;     double st;      n = 10000000;     p = malloc(sizeof(*p) * n);     q = malloc(sizeof(*q) * n);     r = malloc(sizeof(*r) * n);     for(i = 0; i < n; i++) {         p[i] = (rand() % 1000000) - 500000;         q[i] = (rand() % 1000000) + 1;     }      st = ntime();     for(i = 0; i < 100; i++)         timediv(n, p, q, r);     printf("%g\n", ntime() - st);     return(0); }

I compiled this with gcc -march=native -Ofast using GCC 4.9.2, and the results, on my Core i5-2400, were as follows. The results are fairly reproducible from run to run -- they always land in the same order, at least.

Variant 0: 7.21 seconds
Variant 1: 7.26 seconds
Variant 2: 6.73 seconds
Variant 3: 4.32 seconds

So the CMOV implementation blows the others out of the water, at least. What surprises me is that variant 2 out-does its pure-C version (variant 1) by a fairly wide margin. I'd have thought the compiler should be able to emit code at least as efficient as mine.

Here are some other platforms, for comparison:

AMD Athlon 64 X2 4200+, GCC 4.7.2:

Variant 0: 26.33 seconds
Variant 1: 25.38 seconds
Variant 2: 25.19 seconds
Variant 3: 22.39 seconds

Xeon E3-1271 v3, GCC 4.9.2:

Variant 0: 5.95 seconds
Variant 1: 5.62 seconds
Variant 2: 5.40 seconds
Variant 3: 3.44 seconds

As a final note, I should perhaps warn against taking the apparent performance advantage of the CMOV version too seriously, because in the real world, the branch in the other versions will probably not be as completely random as in this benchmark, and if the branch predictor can do a reasonable job, the branching versions may turn out to be better. However, the realities of that will depend quite a bit on the data that are being used in practice, and so is probably pointless to try and do any generic benchmark of.

answered Sep 21 '22 03:09

Dolda2000

Related questions
                            
                                Emacs Haskell indentation
                            
                                Best practice for storing HTML templates on a page?
                            
                                AJAX calls to untrusted (self-signed) HTTPS fail silently
                            
                                Proving the primality of strong probable primes
                            
                                Detect Session Timeout in Ajax Request in Spring MVC
                            
                                Empty goals list in m2Eclipse
                            
                                ListView, SimpleCursorAdapter, an an EditText filter -- why won't it do anything?
                            
                                Weird behavior with main method
                            
                                Fastest way to write HDF5 files with Python?
                            
                                Are automatically generated GUIDs for types in .NET consistent?
                            
                                c++: logger class without globals or singletons or passing it to every method
                            
                                Python 2.6 TreeMap/SortedDictionary?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiently implementing floored / euclidean integer division

Tags:

Vladimir Panteleev

People also ask

2 Answers

Vladimir Panteleev

Dolda2000

Recent Activity

Donate For Us