Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java 8u40 Math.round() very slow

Tags:

I have a fairly simple hobby project written in Java 8 that makes extensive use of repeated Math.round() calls in one of its modes of operation. For example, one such mode spawns 4 threads and queues 48 runnable tasks by way of an ExecutorService, each of which runs something similar to the following block of code 2^31 times:

int3 = Math.round(float1 + float2); int3 = Math.round(float1 * float2); int3 = Math.round(float1 / float2); 

That's not exactly how it is (there are arrays involved, and nested loops), but you get the idea. Anyway, prior to Java 8u40, the code that resembles the above could complete the full run of ~103 billion instruction blocks in about 13 seconds on an AMD A10-7700k. With Java 8u40 it takes around 260 seconds to do the same thing. No changes to code, no nothing, just a Java update.

Has anyone else noticed Math.round() getting much slower, especially when it is used repetitiously? It's almost as though the JVM was doing some sort of optimization before that it isn't doing anymore. Maybe it was using SIMD prior to 8u40 and it isn't now?

edit: I have completed my second attempt at an MVCE. You can download the first attempt here:

https://www.dropbox.com/s/rm2ftcv8y6ye1bi/MathRoundMVCE.zip?dl=0

The second attempt is below. My first attempt has been otherwise removed from this post as it has been deemed to be too long, and is prone to dead code removal optimizations by the JVM (which apparently are happening less in 8u40).

import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors;  public class MathRoundMVCE {                static long grandtotal = 0;     static long sumtotal = 0;      static float[] float4 = new float[128];     static float[] float5 = new float[128];     static int[] int6 = new int[128];     static int[] int7 = new int[128];     static int[] int8 = new int[128];     static long[] longarray = new long[480];      final static int mil = 1000000;      public static void main(String[] args)     {                initmainarrays();         OmniCode omni = new OmniCode();         grandtotal = omni.runloops() / mil;         System.out.println("Total sum of operations is " + sumtotal);         System.out.println("Total execution time is " + grandtotal + " milliseconds");     }         public static long siftarray(long[] larray)     {         long topnum = 0;         long tempnum = 0;         for (short i = 0; i < larray.length; i++)         {             tempnum = larray[i];             if (tempnum > 0)             {                 topnum += tempnum;             }         }         topnum = topnum / Runtime.getRuntime().availableProcessors();         return topnum;     }      public static void initmainarrays()     {         int k = 0;          do         {                        float4[k] = (float)(Math.random() * 12) + 1f;             float5[k] = (float)(Math.random() * 12) + 1f;             int6[k] = 0;              k++;         }         while (k < 128);             }        }  class OmniCode extends Thread {                volatile long totaltime = 0;     final int standard = 16777216;     final int warmup = 200000;      byte threads = 0;      public long runloops()     {         this.setPriority(MIN_PRIORITY);          threads = (byte)Runtime.getRuntime().availableProcessors();         ExecutorService executor = Executors.newFixedThreadPool(threads);          for (short j = 0; j < 48; j++)         {                        executor.execute(new RoundFloatToIntAlternate(warmup, (byte)j));         }          executor.shutdown();          while (!executor.isTerminated())         {             try             {                 Thread.sleep(100);             }              catch (InterruptedException e)             {                 //Do nothing                             }         }          executor = Executors.newFixedThreadPool(threads);          for (short j = 0; j < 48; j++)         {                        executor.execute(new RoundFloatToIntAlternate(standard, (byte)j));                   }          executor.shutdown();          while (!executor.isTerminated())         {             try             {                 Thread.sleep(100);             }              catch (InterruptedException e)             {                 //Do nothing                             }         }          totaltime = MathRoundMVCE.siftarray(MathRoundMVCE.longarray);             executor = null;         Runtime.getRuntime().gc();         return totaltime;     } }  class RoundFloatToIntAlternate extends Thread {            int i = 0;     int j = 0;     int int3 = 0;     int iterations = 0;     byte thread = 0;      public RoundFloatToIntAlternate(int cycles, byte threadnumber)     {         iterations = cycles;         thread = threadnumber;     }      public void run()     {         this.setPriority(9);         MathRoundMVCE.longarray[this.thread] = 0;         mainloop();         blankloop();          }      public void blankloop()     {         j = 0;         long timer = 0;         long totaltimer = 0;          do         {                timer = System.nanoTime();             i = 0;              do             {                 i++;             }             while (i < 128);             totaltimer += System.nanoTime() - timer;                          j++;         }         while (j < iterations);                   MathRoundMVCE.longarray[this.thread] -= totaltimer;     }      public void mainloop()     {         j = 0;         long timer = 0;          long totaltimer = 0;         long localsum = 0;          int[] int6 = new int[128];         int[] int7 = new int[128];         int[] int8 = new int[128];          do         {                timer = System.nanoTime();             i = 0;              do             {                 int6[i] = Math.round(MathRoundMVCE.float4[i] + MathRoundMVCE.float5[i]);                 int7[i] = Math.round(MathRoundMVCE.float4[i] * MathRoundMVCE.float5[i]);                 int8[i] = Math.round(MathRoundMVCE.float4[i] / MathRoundMVCE.float5[i]);                  i++;             }             while (i < 128);             totaltimer += System.nanoTime() - timer;              for(short z = 0; z < 128; z++)             {                 localsum += int6[z] + int7[z] + int8[z];             }                     j++;         }         while (j < iterations);                   MathRoundMVCE.longarray[this.thread] += totaltimer;         MathRoundMVCE.sumtotal = localsum;     } } 

Long story short, this code performed about the same in 8u25 as in 8u40. As you can see, I am now recording the results of all calculations into arrays, and then summing those arrays outside of the timed portion of the loop to a local variable which is then written to a static variable at the end of the outer loop.

Under 8u25: Total execution time is 261545 milliseconds

Under 8u40: Total execution time is 266890 milliseconds

Test conditions were the same as before. So, it would appear that 8u25 and 8u31 were doing dead code removal that 8u40 stopped doing, causing the code to "slow down" in 8u40. That doesn't explain every weird little thing that's cropped up but that appears to be the bulk of it. As an added bonus, the suggestions and answers provided here have given me inspiration to improve the other parts of my hobby project, for which I am quite grateful. Thank you all for that!

like image 936
user3765373 Avatar asked Mar 14 '15 16:03

user3765373


1 Answers

Casual benchmarking: you benchmark A, but actually measure B, and conclude you've measured C.

Modern JVMs are too complex, and do all kinds of optimization. If you try to measure some small piece of code, it is really complicated to do it correctly without very, very detailed knowledge of what the JVM is doing. The culprit of many benchmarks is the dead-code elimination: compilers are smart enough to deduce some computations are redundant, and eliminate them completely. Please read the following slides http://shipilev.net/talks/jvmls-July2014-benchmarking.pdf. In order to "fix" Adam's microbenchmark(I still can't understand what it is measuring and this "fix" does not take into account warm up, OSR and many others microbenchmarking pitfalls) we have to print the result of the computation to system output:

    int result = 0;     long t0 = System.currentTimeMillis();     for (int i = 0; i < 1e9; i++) {         result += Math.round((float) i / (float) (i + 1));     }     long t1 = System.currentTimeMillis();     System.out.println("result = " + result);     System.out.println(String.format("%s, Math.round(float), %.1f ms", System.getProperty("java.version"), (t1 - t0)/1f)); 

As a result:

result = 999999999 1.8.0_25, Math.round(float), 5251.0 ms  result = 999999999 1.8.0_40, Math.round(float), 3903.0 ms 

The same "fix" for original MVCE example

It took 401772 milliseconds to complete edu.jvm.runtime.RoundFloatToInt. <==== 1.8.0_40  It took 410767 milliseconds to complete edu.jvm.runtime.RoundFloatToInt. <==== 1.8.0_25 

If you want to measure the real cost of Math#round you should write something like this(based on jmh)

package org.openjdk.jmh.samples;  import org.openjdk.jmh.annotations.*; import org.openjdk.jmh.runner.Runner; import org.openjdk.jmh.runner.RunnerException; import org.openjdk.jmh.runner.options.Options; import org.openjdk.jmh.runner.options.OptionsBuilder; import org.openjdk.jmh.runner.options.VerboseMode;  import java.util.Random; import java.util.concurrent.TimeUnit;  @State(Scope.Benchmark) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS) public class RoundBench {      float[] floats;     int i;      @Setup     public void initI() {         Random random = new Random(0xDEAD_BEEF);         floats = new float[8096];         for (int i = 0; i < floats.length; i++) {             floats[i] = random.nextFloat();         }     }      @Benchmark     public float baseline() {         i++;         i = i & 0xFFFFFF00;         return floats[i];     }      @Benchmark     public int round() {         i++;         i = i & 0xFFFFFF00;         return Math.round(floats[i]);     }      public static void main(String[] args) throws RunnerException {         Options options = new OptionsBuilder()                 .include(RoundBench.class.getName())                 .build();         new Runner(options).run();     } } 

My results are:

1.8.0_25 Benchmark            Mode  Cnt  Score   Error  Units RoundBench.baseline  avgt    6  2.565 ± 0.028  ns/op RoundBench.round     avgt    6  4.459 ± 0.065  ns/op  1.8.0_40  Benchmark            Mode  Cnt  Score   Error  Units RoundBench.baseline  avgt    6  2.589 ± 0.045  ns/op RoundBench.round     avgt    6  4.588 ± 0.182  ns/op 

In order to find the root cause of the problem you can use https://github.com/AdoptOpenJDK/jitwatch/. To save time, I can say that the size of JITted code for Math#round was increased in 8.0_40. It is almost unnoticeable for small methods, but in case of huge methods too long sheet of machine code pollutes instruction cache.

like image 190
Ivan Mamontov Avatar answered Oct 03 '22 08:10

Ivan Mamontov