Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Math.min/max performance

Tags:

EDIT: maaartinus gave the answer I was looking for and tmyklebu's data on the problem helped a lot, so thanks both! :)

I've read a bit about how HotSpot has some "intrinsics" that injects in the code, specially for Java standard Math libs (from here)

So I decided to give it a try, to see how much difference HotSpot could make against doing the comparison directly (specially since I've heard min/max can compile to branchless asm).

public class OpsMath {     public static final int max(final int a, final int b) {         if (a > b) {             return a;         }         return b;     } } 

That's my implementation. From another SO question I've read that using the ternary operator uses an extra register, I haven't found significant differences between doing an if block and using a ternary operator (ie, return ( a > b ) ? a : b ).

Allocating a 8Mb int array (ie, 2 million values), and randomizing it, I do the following test:

try ( final Benchmark bench = new Benchmark( "millis to max" ) )     {         int max = Integer.MIN_VALUE;          for ( int i = 0; i < array.length; ++i )         {             max = OpsMath.max( max, array[i] );             // max = Math.max( max, array[i] );         }     } 

I'm using a Benchmark object in a try-with-resources block. When it finishes, it calls close() on the object and prints the time the block took to complete. The tests are done separately by commenting in/out the max calls in the code above.

'max' is added to a list outside the benchmark block and printed later, so to avoid the JVM optimizing the whole block away.

The array is randomized each time the test runs.

Running the test 6 times, it gives these results:

Java standard Math:

millis to max 9.242167  millis to max 2.1566199999999998 millis to max 2.046396  millis to max 2.048616   millis to max 2.035761 millis to max 2.001044  

So fairly stable after the first run, and running the tests again gives similar results.

OpsMath:

millis to max 8.65418  millis to max 1.161559   millis to max 0.955851  millis to max 0.946642  millis to max 0.994543  millis to max 0.9469069999999999  

Again, very stable results after the first run.

The question is: Why? Thats quite a big difference there. And I have no idea why. Even if I implement my max() method exactly like Math.max() (ie, return (a >= b) ? a : b ) I still get better results! It makes no sense.

Specs:

CPU: Intel i5 2500, 3,3Ghz. Java Version: JDK 8 (public march 18 release), x64. Debian Jessie (testing release) x64.

I have yet to try with 32 bit JVM.

EDIT: Self contained test as requested. Added a line to force the JVM to preload Math and OpsMath classes. That eliminates the 18ms cost of the first iteration for OpsMath test.

// Constant nano to millis. final double TO_MILLIS = 1.0d / 1000000.0d; // 8Mb alloc. final int[] array = new int[(8*1024*1024)/4]; // Result and time array. final ArrayList<Integer> results = new ArrayList<>(); final ArrayList<Double> times = new ArrayList<>(); // Number of tests. final int itcount = 6; // Call both Math and OpsMath method so JVM initializes the classes. System.out.println("initialize classes " +  OpsMath.max( Math.max( 20.0f, array.length ), array.length / 2.0f ));      final Random r = new Random(); for ( int it = 0; it < itcount; ++it ) {     int max = Integer.MIN_VALUE;          // Randomize the array.     for ( int i = 0; i < array.length; ++i )     {         array[i] = r.nextInt();     }          final long start = System.nanoTime();     for ( int i = 0; i < array.length; ++i )     {         max = Math.max( array[i], max );             // OpsMath.max() method implemented as described.         // max = OpsMath.max( array[i], max );     }     // Calc time.     final double end = (System.nanoTime() - start);     // Store results.     times.add( Double.valueOf( end ) );     results.add( Integer.valueOf(  max ) ); } // Print everything. for ( int i = 0; i < itcount; ++i ) {     System.out.println( "IT" + i + " result: " + results.get( i ) );     System.out.println( "IT" + i + " millis: " + times.get( i ) * TO_MILLIS ); } 

Java Math.max result:

IT0 result: 2147477409 IT0 millis: 9.636998 IT1 result: 2147483098 IT1 millis: 1.901314 IT2 result: 2147482877 IT2 millis: 2.095551 IT3 result: 2147483286 IT3 millis: 1.9232859999999998 IT4 result: 2147482828 IT4 millis: 1.9455179999999999 IT5 result: 2147482475 IT5 millis: 1.882047 

OpsMath.max result:

IT0 result: 2147482689 IT0 millis: 9.003616 IT1 result: 2147483480 IT1 millis: 0.882421 IT2 result: 2147483186 IT2 millis: 1.079143 IT3 result: 2147478560 IT3 millis: 0.8861169999999999 IT4 result: 2147477851 IT4 millis: 0.916383 IT5 result: 2147481983 IT5 millis: 0.873984 

Still the same overall results. I've tried with randomizing the array only once, and repeating the tests over the same array, I get faster results overall, but the same 2x difference between Java Math.max and OpsMath.max.

like image 438
TheStack Avatar asked Mar 31 '14 01:03

TheStack


People also ask

Is math min slow?

Using Math. Min in a loop is almost 4 times slower than before.

Is Max math slower?

max method is likely to be slower, but it also may return a different result if one of the arguments is NaN.

Why is math MAX () less than math MIN ()?

max() starts with a search value of -Infinity , because any other number is going to be greater than -Infinity. Similarly, Math. min() starts with the search value of Infinity : “If no arguments are given, the result is Infinity .

Is there a min and max function in Java?

Collections. min() method return the minimum element in the specified collection and Collections. max () returns the maximum element in the specified collection, according to the natural ordering of its elements.


2 Answers

It's hard to tell why Math.max is slower than a Ops.max, but it's easy to tell why this benchmark strongly favors branching to conditional moves: On the n-th iteration, the probability of

Math.max( array[i], max ); 

being not equal to max is the probability that array[n-1] is bigger than all previous elements. Obviously, this probability gets lower and lower with growing n and given

final int[] array = new int[(8*1024*1024)/4]; 

it's rather negligible most of the time. The conditional move instruction is insensitive to the branching probability, it always take the same amount of time to execute. The conditional move instruction is faster than branch prediction if the branch is very hard to predict. On the other hand, branch prediction is faster if the branch can be predicted well with high probability. Currently, I'm unsure about the speed of conditional move compared to best and worst case of branching.1

In your case all but first few branches are fairly predictable. From about n == 10 onward, there's no point in using conditional moves as the branch is rather guaranteed to be predicted correctly and can execute in parallel with other instructions (I guess you need exactly one cycle per iteration).

This seems to happen for algorithms computing minimum/maximum or doing some inefficient sorting (good branch predictability means low entropy per step).


1 Both conditional move and predicted branch take one cycle. The problem with the former is that it needs its two operands and this takes additional instruction. In the end the critical path may get longer and/or the ALUs saturated while the branching unit is idle. Often, but not always, branches can be predicted well in practical applications; that's why branch prediction was invented in the first place.

As for the gory details of timing conditional move vs. branch prediction best and worst case, see the discussion below in comments. My my own benchmark shows that conditional move is significantly faster than branch prediction when branch prediction encounters its worst case, but I can't ignore contradictory results. We need some explanation for what exactly makes the difference. Some more benchmarks and/or analysis could help.

like image 192
maaartinus Avatar answered Sep 28 '22 10:09

maaartinus


When I run your (suitably-modified) code using Math.max on an old (1.6.0_27) JVM, the hot loop looks like this:

0x00007f4b65425c50: mov    %r11d,%edi         ;*getstatic array                                               ; - foo146::bench@81 (line 40) 0x00007f4b65425c53: mov    0x10(%rax,%rdx,4),%r8d 0x00007f4b65425c58: mov    0x14(%rax,%rdx,4),%r10d 0x00007f4b65425c5d: mov    0x18(%rax,%rdx,4),%ecx 0x00007f4b65425c61: mov    0x2c(%rax,%rdx,4),%r11d 0x00007f4b65425c66: mov    0x28(%rax,%rdx,4),%r9d 0x00007f4b65425c6b: mov    0x24(%rax,%rdx,4),%ebx 0x00007f4b65425c6f: rex mov    0x20(%rax,%rdx,4),%esi 0x00007f4b65425c74: mov    0x1c(%rax,%rdx,4),%r14d  ;*iaload                                               ; - foo146::bench@86 (line 40) 0x00007f4b65425c79: cmp    %edi,%r8d 0x00007f4b65425c7c: cmovl  %edi,%r8d 0x00007f4b65425c80: cmp    %r8d,%r10d 0x00007f4b65425c83: cmovl  %r8d,%r10d 0x00007f4b65425c87: cmp    %r10d,%ecx 0x00007f4b65425c8a: cmovl  %r10d,%ecx 0x00007f4b65425c8e: cmp    %ecx,%r14d 0x00007f4b65425c91: cmovl  %ecx,%r14d 0x00007f4b65425c95: cmp    %r14d,%esi 0x00007f4b65425c98: cmovl  %r14d,%esi 0x00007f4b65425c9c: cmp    %esi,%ebx 0x00007f4b65425c9e: cmovl  %esi,%ebx 0x00007f4b65425ca1: cmp    %ebx,%r9d 0x00007f4b65425ca4: cmovl  %ebx,%r9d 0x00007f4b65425ca8: cmp    %r9d,%r11d 0x00007f4b65425cab: cmovl  %r9d,%r11d         ;*invokestatic max                                               ; - foo146::bench@88 (line 40) 0x00007f4b65425caf: add    $0x8,%edx          ;*iinc                                               ; - foo146::bench@92 (line 39) 0x00007f4b65425cb2: cmp    $0x1ffff9,%edx 0x00007f4b65425cb8: jl     0x00007f4b65425c50 

Apart from the weirdly-placed REX prefix (not sure what that's about), here you have a loop that's been unrolled 8 times that does mostly what you'd expect---loads, comparisons, and conditional moves. Interestingly, if you swap the order of the arguments to max, here it outputs the other kind of 8-deep cmovl chain. I guess it doesn't know how to generate a 3-deep tree of cmovls or 8 separate cmovl chains to be merged after the loop is done.

With the explicit OpsMath.max, it turns into a ratsnest of conditional and unconditional branches that's unrolled 8 times. I'm not going to post the loop; it's not pretty. Basically each mov/cmp/cmovl above gets broken into a load, a compare and a conditional jump to where a mov and a jmp happen. Interestingly, if you swap the order of the arguments to max, here it outputs an 8-deep cmovle chain instead. EDIT: As @maaartinus points out, said ratsnest of branches is actually faster on some machines because the branch predictor works its magic on them and these are well-predicted branches.

I would hesitate to draw conclusions from this benchmark. You have benchmark construction issues; you have to run it a lot more times than you are and you have to factor your code differently if you want to time Hotspot's fastest code. Beyond the wrapper code, you aren't measuring how fast your max is, or how well Hotspot understands what you're trying to do, or anything else of value here. Both implementations of max will result in code that's entirely too fast for any sort of direct measurement to be meaningful within the context of a larger program.

like image 40
tmyklebu Avatar answered Sep 28 '22 10:09

tmyklebu