Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"mod 4" versus "& 3" performance

I have tried to test if using var & 3 is faster than var % 4 in java (it could also be & 2^n - 1 vs. % 2^n). I have made a simple program to calculate the average time it takes to do the calculations, but I get strange results and I can't conclude. For about 1000 calculations, the average is that mod 4 takes much more time, but when I try with about 1000000 calculations, both averages are about the same... I suspect this is due to java optimization of my code, but I am not sure.

Which of those two operations is supposed to be faster, and how is % implemented?

Thanks!

EDIT: Here is my test program.

    long startTime, time, sum;
    int iterations = 1000;
    int v;

    sum = 0;
    for(int i = 0; i < iterations; i++)
    {
        startTime = System.nanoTime();
        v = i % 4;
        time = System.nanoTime();
        sum += time-startTime;
    }
    System.out.println("Mod 4 : "+(sum/iterations));

    sum = 0;
    for(int i = 0; i < iterations; i++)
    {
        startTime = System.nanoTime();
        v = i & 3;
        time = System.nanoTime();
        sum += time-startTime;
    }
    System.out.println("& 3 : "+(sum/iterations));

With 100 iterations, I get 130 nanosec for mod 4 and 25060 nanosec for & 3.

For 1000 iterations, I get 1792 nanosec for mod 4 and 81 nanosec for & 3.

With 1000000 iterations, I get about 50 nanosec for both, while having mod 4 always a few nanosec longer.


1 Answers

Java, or any compiler for that matter, may optimize this statically or at runtime (for ones with JIT-ing capabilities), so it's hard to tell what your code really does, but if you inspect the machine code being performed eventually on any host machine, it's almost guaranteed that doing an AND operation would be significantly faster in terms of latency (and probably also in throughput) than modulo. The first requires very simple ALU unit that usually exists in abundance on most CPU cores, while the modulo would probably have to be performed over a divider unit that is both slower and more scarce (i.e. - exists on less execution ports).

However, there are too many layers between your java code and the actual bare metal CPU to be able to give concrete answer, you should either switch to a lower level of benchmarking (c or assembly), or take into account the other factors and observe the bytecode and on-the-fly changes made by the compiler.

like image 181
Leeor Avatar answered Nov 29 '25 02:11

Leeor



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!