I have a fairly simple hobby project written in Java 8 that makes extensive use of repeated Math.round() calls in one of its modes of operation. For example, one such mode spawns 4 threads and queues 48 runnable tasks by way of an ExecutorService, each of which runs something similar to the following block of code 2^31 times:
int3 = Math.round(float1 + float2); int3 = Math.round(float1 * float2); int3 = Math.round(float1 / float2);
That's not exactly how it is (there are arrays involved, and nested loops), but you get the idea. Anyway, prior to Java 8u40, the code that resembles the above could complete the full run of ~103 billion instruction blocks in about 13 seconds on an AMD A10-7700k. With Java 8u40 it takes around 260 seconds to do the same thing. No changes to code, no nothing, just a Java update.
Has anyone else noticed Math.round() getting much slower, especially when it is used repetitiously? It's almost as though the JVM was doing some sort of optimization before that it isn't doing anymore. Maybe it was using SIMD prior to 8u40 and it isn't now?
edit: I have completed my second attempt at an MVCE. You can download the first attempt here:
https://www.dropbox.com/s/rm2ftcv8y6ye1bi/MathRoundMVCE.zip?dl=0
The second attempt is below. My first attempt has been otherwise removed from this post as it has been deemed to be too long, and is prone to dead code removal optimizations by the JVM (which apparently are happening less in 8u40).
import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; public class MathRoundMVCE { static long grandtotal = 0; static long sumtotal = 0; static float[] float4 = new float[128]; static float[] float5 = new float[128]; static int[] int6 = new int[128]; static int[] int7 = new int[128]; static int[] int8 = new int[128]; static long[] longarray = new long[480]; final static int mil = 1000000; public static void main(String[] args) { initmainarrays(); OmniCode omni = new OmniCode(); grandtotal = omni.runloops() / mil; System.out.println("Total sum of operations is " + sumtotal); System.out.println("Total execution time is " + grandtotal + " milliseconds"); } public static long siftarray(long[] larray) { long topnum = 0; long tempnum = 0; for (short i = 0; i < larray.length; i++) { tempnum = larray[i]; if (tempnum > 0) { topnum += tempnum; } } topnum = topnum / Runtime.getRuntime().availableProcessors(); return topnum; } public static void initmainarrays() { int k = 0; do { float4[k] = (float)(Math.random() * 12) + 1f; float5[k] = (float)(Math.random() * 12) + 1f; int6[k] = 0; k++; } while (k < 128); } } class OmniCode extends Thread { volatile long totaltime = 0; final int standard = 16777216; final int warmup = 200000; byte threads = 0; public long runloops() { this.setPriority(MIN_PRIORITY); threads = (byte)Runtime.getRuntime().availableProcessors(); ExecutorService executor = Executors.newFixedThreadPool(threads); for (short j = 0; j < 48; j++) { executor.execute(new RoundFloatToIntAlternate(warmup, (byte)j)); } executor.shutdown(); while (!executor.isTerminated()) { try { Thread.sleep(100); } catch (InterruptedException e) { //Do nothing } } executor = Executors.newFixedThreadPool(threads); for (short j = 0; j < 48; j++) { executor.execute(new RoundFloatToIntAlternate(standard, (byte)j)); } executor.shutdown(); while (!executor.isTerminated()) { try { Thread.sleep(100); } catch (InterruptedException e) { //Do nothing } } totaltime = MathRoundMVCE.siftarray(MathRoundMVCE.longarray); executor = null; Runtime.getRuntime().gc(); return totaltime; } } class RoundFloatToIntAlternate extends Thread { int i = 0; int j = 0; int int3 = 0; int iterations = 0; byte thread = 0; public RoundFloatToIntAlternate(int cycles, byte threadnumber) { iterations = cycles; thread = threadnumber; } public void run() { this.setPriority(9); MathRoundMVCE.longarray[this.thread] = 0; mainloop(); blankloop(); } public void blankloop() { j = 0; long timer = 0; long totaltimer = 0; do { timer = System.nanoTime(); i = 0; do { i++; } while (i < 128); totaltimer += System.nanoTime() - timer; j++; } while (j < iterations); MathRoundMVCE.longarray[this.thread] -= totaltimer; } public void mainloop() { j = 0; long timer = 0; long totaltimer = 0; long localsum = 0; int[] int6 = new int[128]; int[] int7 = new int[128]; int[] int8 = new int[128]; do { timer = System.nanoTime(); i = 0; do { int6[i] = Math.round(MathRoundMVCE.float4[i] + MathRoundMVCE.float5[i]); int7[i] = Math.round(MathRoundMVCE.float4[i] * MathRoundMVCE.float5[i]); int8[i] = Math.round(MathRoundMVCE.float4[i] / MathRoundMVCE.float5[i]); i++; } while (i < 128); totaltimer += System.nanoTime() - timer; for(short z = 0; z < 128; z++) { localsum += int6[z] + int7[z] + int8[z]; } j++; } while (j < iterations); MathRoundMVCE.longarray[this.thread] += totaltimer; MathRoundMVCE.sumtotal = localsum; } }
Long story short, this code performed about the same in 8u25 as in 8u40. As you can see, I am now recording the results of all calculations into arrays, and then summing those arrays outside of the timed portion of the loop to a local variable which is then written to a static variable at the end of the outer loop.
Under 8u25: Total execution time is 261545 milliseconds
Under 8u40: Total execution time is 266890 milliseconds
Test conditions were the same as before. So, it would appear that 8u25 and 8u31 were doing dead code removal that 8u40 stopped doing, causing the code to "slow down" in 8u40. That doesn't explain every weird little thing that's cropped up but that appears to be the bulk of it. As an added bonus, the suggestions and answers provided here have given me inspiration to improve the other parts of my hobby project, for which I am quite grateful. Thank you all for that!
Casual benchmarking: you benchmark A, but actually measure B, and conclude you've measured C.
Modern JVMs are too complex, and do all kinds of optimization. If you try to measure some small piece of code, it is really complicated to do it correctly without very, very detailed knowledge of what the JVM is doing. The culprit of many benchmarks is the dead-code elimination: compilers are smart enough to deduce some computations are redundant, and eliminate them completely. Please read the following slides http://shipilev.net/talks/jvmls-July2014-benchmarking.pdf. In order to "fix" Adam's microbenchmark(I still can't understand what it is measuring and this "fix" does not take into account warm up, OSR and many others microbenchmarking pitfalls) we have to print the result of the computation to system output:
int result = 0; long t0 = System.currentTimeMillis(); for (int i = 0; i < 1e9; i++) { result += Math.round((float) i / (float) (i + 1)); } long t1 = System.currentTimeMillis(); System.out.println("result = " + result); System.out.println(String.format("%s, Math.round(float), %.1f ms", System.getProperty("java.version"), (t1 - t0)/1f));
As a result:
result = 999999999 1.8.0_25, Math.round(float), 5251.0 ms result = 999999999 1.8.0_40, Math.round(float), 3903.0 ms
The same "fix" for original MVCE example
It took 401772 milliseconds to complete edu.jvm.runtime.RoundFloatToInt. <==== 1.8.0_40 It took 410767 milliseconds to complete edu.jvm.runtime.RoundFloatToInt. <==== 1.8.0_25
If you want to measure the real cost of Math#round you should write something like this(based on jmh)
package org.openjdk.jmh.samples; import org.openjdk.jmh.annotations.*; import org.openjdk.jmh.runner.Runner; import org.openjdk.jmh.runner.RunnerException; import org.openjdk.jmh.runner.options.Options; import org.openjdk.jmh.runner.options.OptionsBuilder; import org.openjdk.jmh.runner.options.VerboseMode; import java.util.Random; import java.util.concurrent.TimeUnit; @State(Scope.Benchmark) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS) public class RoundBench { float[] floats; int i; @Setup public void initI() { Random random = new Random(0xDEAD_BEEF); floats = new float[8096]; for (int i = 0; i < floats.length; i++) { floats[i] = random.nextFloat(); } } @Benchmark public float baseline() { i++; i = i & 0xFFFFFF00; return floats[i]; } @Benchmark public int round() { i++; i = i & 0xFFFFFF00; return Math.round(floats[i]); } public static void main(String[] args) throws RunnerException { Options options = new OptionsBuilder() .include(RoundBench.class.getName()) .build(); new Runner(options).run(); } }
My results are:
1.8.0_25 Benchmark Mode Cnt Score Error Units RoundBench.baseline avgt 6 2.565 ± 0.028 ns/op RoundBench.round avgt 6 4.459 ± 0.065 ns/op 1.8.0_40 Benchmark Mode Cnt Score Error Units RoundBench.baseline avgt 6 2.589 ± 0.045 ns/op RoundBench.round avgt 6 4.588 ± 0.182 ns/op
In order to find the root cause of the problem you can use https://github.com/AdoptOpenJDK/jitwatch/. To save time, I can say that the size of JITted code for Math#round was increased in 8.0_40. It is almost unnoticeable for small methods, but in case of huge methods too long sheet of machine code pollutes instruction cache.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With