Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why java division for integer is faster than hacker's delight implementation

I am testing divs10 function throughput from hacker's delight book, coded in java on my jdk 1.7 64bit version 21 and i7 intel box processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz

I am wondering why the default java operator / is faster than divs10 function from hacker's delight book, the result shows divs10 is 3 times slower than "/" operator, to my surprise.

anybody can tell me if there is any fancy intrinsic jvm can be using?

source code as below.

 public class div10 {

            public static final int divs10(int n) {
                   int q, r;

                   n = n + (n >> 31 & 9);
                   q = (n >> 1) + (n >> 2);
                   q += q >> 4;
                   q += q >> 8;
                   q += q >> 16;
                   q = q >> 3;
                   r = n - ((q << 3) + (q << 1));
                   return q + ((r + 6) >> 4);
            }

            public static void main(String[] args) {
                /*
                long count = 0;
                for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
                    if ( (i/10) != divs10(i) ) {
                        System.err.println("error dividing :" + i );
                    }
                    if ((i & 0xFFFFFFF ) == 0 ) {
                        System.out.println("Finished:" + Long.toHexString(count) + ":" + count + ":" + i);
                    }
                    count++;
                }

                System.out.println("Success:" + count);
                */

                long start = System.nanoTime();
                long count = 0L;
                int iter = 100_000;
                for (int j = 0; j < 10; j++) 
                    for (int i = -iter; i < iter; i++) {
                        count += (i/10);
                    }
                for (int j = 0; j < 10; j++) 
                    for (int i = -iter; i < iter; i++) {
                        count += divs10(i);
                    }
                System.out.println(count + " warm up done ") ;


                start = System.nanoTime();
                count = 0L;
                for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
                    count += i/10;
                }
                System.out.println(count + ", took:" + (System.nanoTime() - start) / 1000_000L + " ms, " + (System.nanoTime() - start) / ((long)Integer.MAX_VALUE - (long)Integer.MIN_VALUE) + " ns per ops" ) ;

                start = System.nanoTime();
                count = 0L;
                for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
                    count += divs10(i);
                }
                System.out.println(count + ", took:" + (System.nanoTime() - start) / 1000_000L + " ms, " + (System.nanoTime() - start) / ((long)Integer.MAX_VALUE - (long)Integer.MIN_VALUE) + " ns per ops" ) ;

           }
    }
like image 366
porkchop Avatar asked Mar 10 '14 21:03

porkchop


1 Answers

Update: When looking at the newer Ivy Bridge table (p. 174), I saw that all the latencies where 1. This means that my previous explanation was not correct.

An attempt in counting the instructions that are executed in the divs10 method is 27 (without overhead of function calling) instructions. You are having operations that require the previous one to be completed before the next one can start. So that means that you should consider the latency of the instructions. According to the Ivy Bridge instruction table, all of the instructions involved have a latency of 1 clock cycle. This gives you a total of 27 clock cycles.

This in comparison with a single IDIV (8-bit) instruction. In the table, I can find that this takes about 20 clock cycles latency.

A raw estimation would give: 27 cycles / 20 cycles = 1.35 times slower. This does not agree with the results you have. I'm not an expert at this, but I think this is due to the fact that divisions with the IDIV instruction can run in parallel, because they are independent. The IDIV instruction has a throughput of 8 clock cycles. Which allows the CPU to optimize the instructions in that way that it can run about 4 divisions per 52 cycles (this is an estimation).

So, to perform 4 divisions with the bit-shifting algorithm, you would need 108 cycles, whereas the IDIV would need approximately 64 clock cycles. This gives: 108 / 52 = 2.1 times slower.

This gets close to the ratio you measured. I guess that the remaining extra time goes to overhead of function calling. Maybe CPU's do greater optimizations than my estimation.

like image 151
Martijn Courteaux Avatar answered Nov 23 '22 23:11

Martijn Courteaux