I have chanced upon an instance of exceedingly erratic performance profile of a very simple map/reduce operation on primitive arrays. Here is my jmh benchmark code:
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(Measure.ARRAY_SIZE)
@Warmup(iterations = 300, time = 200, timeUnit=MILLISECONDS)
@Measurement(iterations = 1, time = 1000, timeUnit=MILLISECONDS)
@State(Scope.Thread)
@Threads(1)
@Fork(1)
public class Measure
{
static final int ARRAY_SIZE = 1<<20;
final int[] ds = new int[ARRAY_SIZE];
private IntUnaryOperator mapper;
@Setup public void setup() {
setAll(ds, i->(int)(Math.random()*(1<<7)));
final int multiplier = (int)(Math.random()*10);
mapper = d -> multiplier*d;
}
@Benchmark public double multiply() {
return Arrays.stream(ds).map(mapper).sum();
}
}
And here are the snippets of the typical output:
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Warmup: 300 iterations, 200 ms each
# Measurement: 1 iterations, 1000 ms each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.sample.Measure.multiply
# Run progress: 0,00% complete, ETA 00:01:01
# Fork: 1 of 1
# Warmup Iteration 1: 0,779 ns/op
# Warmup Iteration 2: 0,684 ns/op
# Warmup Iteration 3: 0,608 ns/op
# Warmup Iteration 4: 0,619 ns/op
# Warmup Iteration 5: 0,642 ns/op
# Warmup Iteration 6: 0,638 ns/op
# Warmup Iteration 7: 0,660 ns/op
# Warmup Iteration 8: 0,611 ns/op
# Warmup Iteration 9: 0,636 ns/op
# Warmup Iteration 10: 0,692 ns/op
# Warmup Iteration 11: 0,632 ns/op
# Warmup Iteration 12: 0,612 ns/op
# Warmup Iteration 13: 1,280 ns/op
# Warmup Iteration 14: 7,261 ns/op
# Warmup Iteration 15: 7,379 ns/op
# Warmup Iteration 16: 7,376 ns/op
# Warmup Iteration 17: 7,379 ns/op
# Warmup Iteration 18: 7,195 ns/op
# Warmup Iteration 19: 7,351 ns/op
# Warmup Iteration 20: 7,761 ns/op
....
....
....
# Warmup Iteration 100: 7,300 ns/op
# Warmup Iteration 101: 7,384 ns/op
# Warmup Iteration 102: 7,132 ns/op
# Warmup Iteration 103: 7,278 ns/op
# Warmup Iteration 104: 7,331 ns/op
# Warmup Iteration 105: 7,335 ns/op
# Warmup Iteration 106: 7,450 ns/op
# Warmup Iteration 107: 7,346 ns/op
# Warmup Iteration 108: 7,826 ns/op
# Warmup Iteration 109: 7,221 ns/op
# Warmup Iteration 110: 8,017 ns/op
# Warmup Iteration 111: 7,611 ns/op
# Warmup Iteration 112: 7,376 ns/op
# Warmup Iteration 113: 0,707 ns/op
# Warmup Iteration 114: 0,828 ns/op
# Warmup Iteration 115: 0,608 ns/op
# Warmup Iteration 116: 0,634 ns/op
# Warmup Iteration 117: 0,633 ns/op
# Warmup Iteration 118: 0,660 ns/op
# Warmup Iteration 119: 0,635 ns/op
# Warmup Iteration 120: 0,566 ns/op
The key moments happen at iterations 13 and 113: first the performance is degraded by a factor of ten, then it is restored. The corresponding times are 2.5 and 22.5 seconds into the test run. The timing of these events is very sensitive to the array size, BTW.
What can possibly explain this sort of behavior? The JIT compiler had probably done its work within the first iteration; there is no GC action to speak of (confirmed by VisualVM)... I am at a complete loss as to any kind of explanation.
My version of Java (OS X):
$ java -version
java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)
Take these steps for your warmup: Do about five to 10 minutes of light aerobic exercise to loosen up your muscles and warm up for your run. Some good pre-run warmup exercises include walking briskly, marching, jogging slowly, or cycling on a stationary bike. Make sure you don't rush your warmup.
Instead of holding still while stretching (also called static stretching), during a dynamic warm-up you move multiple muscles and joints. A warm-up should last approximately 5 to 10 minutes. Choose a warm-up that mimics the exercises you will be doing.
A good warm-up should last five to 10 minutes and work all major muscle groups. For best results, start slowly, then pick up the pace. Many warm-up routines focus on cardio and range-of-motion exercises, such as jumping jacks and lunges.
The JIT will first compile the hot loop that is iterating over and operating (map/reduce) on the array elements. This happens quite early on since the array contains 220 elements.
Later on the JIT compiles the pipeline, most likely inlined within the compiled benchmark method, and due to inlining limits fails to compile it all into one method. It just so happens those inlining limits are reached in the hot loop, and calls to map or sum are not inlined, so the hot loop is unintentionally "de-optimized".
Use the options -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+PrintInlining
when running the benchmark and early on you should see output like the following:
1202 487 % 4 java.util.Spliterators$IntArraySpliterator::forEachRemaining @ 49 (68 bytes)
@ 53 java.util.stream.IntPipeline$3$1::accept (23 bytes) inline (hot)
\-> TypeProfile (1186714/1186714 counts) = java/util/stream/IntPipeline$3$1
@ 12 test.Measure$$Lambda$2/1745776415::applyAsInt (9 bytes) inline (hot)
\-> TypeProfile (1048107/1048107 counts) = test/Measure$$Lambda$2
@ 5 test.Measure::lambda$setup$1 (4 bytes) inline (hot)
@ 17 java.util.stream.ReduceOps$5ReducingSink::accept (19 bytes) inline (hot)
\-> TypeProfile (1048107/1048107 counts) = java/util/stream/ReduceOps$5ReducingSink
@ 10 java.util.stream.IntPipeline$$Lambda$3/1779653790::applyAsInt (6 bytes) inline (hot)
\-> TypeProfile (1048064/1048064 counts) = java/util/stream/IntPipeline$$Lambda$3
@ 2 java.lang.Integer::sum (4 bytes) inline (hot)
That is the hot loop getting compiled. (The %
means it is On Stack Replaced, or OSR'ed)
Later on further compilation of the stream pipeline occurs (at i suspect ~10,000 iterations of the benchmark method, but I have not verified):
@ 16 java.util.stream.IntPipeline::sum (11 bytes) inline (hot)
\-> TypeProfile (5120/5120 counts) = java/util/stream/IntPipeline$3
@ 2 java.lang.invoke.LambdaForm$MH/1279902262::linkToTargetMethod (8 bytes) force inline by annotation
@ 4 java.lang.invoke.LambdaForm$MH/1847865997::identity (18 bytes) force inline by annotation
@ 14 java.lang.invoke.LambdaForm$DMH/2024969684::invokeStatic_L_L (14 bytes) force inline by annotation
@ 1 java.lang.invoke.DirectMethodHandle::internalMemberName (8 bytes) force inline by annotation
@ 10 sun.invoke.util.ValueConversions::identity (2 bytes) inline (hot)
@ 7 java.util.stream.IntPipeline::reduce (16 bytes) inline (hot)
@ 3 java.util.stream.ReduceOps::makeInt (18 bytes) inline (hot)
@ 1 java.util.Objects::requireNonNull (14 bytes) inline (hot)
@ 14 java.util.stream.ReduceOps$5::<init> (16 bytes) inline (hot)
@ 12 java.util.stream.ReduceOps$ReduceOp::<init> (10 bytes) inline (hot)
@ 1 java.lang.Object::<init> (1 bytes) inline (hot)
@ 6 java.util.stream.AbstractPipeline::evaluate (94 bytes) inline (hot)
@ 50 java.util.stream.AbstractPipeline::isParallel (8 bytes) inline (hot)
@ 80 java.util.stream.TerminalOp::getOpFlags (2 bytes) inline (hot)
\-> TypeProfile (5122/5122 counts) = java/util/stream/ReduceOps$5
@ 85 java.util.stream.AbstractPipeline::sourceSpliterator (163 bytes) inline (hot)
@ 79 java.util.stream.AbstractPipeline::isParallel (8 bytes) inline (hot)
@ 88 java.util.stream.ReduceOps$ReduceOp::evaluateSequential (18 bytes) inline (hot)
@ 2 java.util.stream.ReduceOps$5::makeSink (5 bytes) inline (hot)
@ 1 java.util.stream.ReduceOps$5::makeSink (16 bytes) inline (hot)
@ 12 java.util.stream.ReduceOps$5ReducingSink::<init> (15 bytes) inline (hot)
@ 11 java.lang.Object::<init> (1 bytes) inline (hot)
@ 6 java.util.stream.AbstractPipeline::wrapAndCopyInto (18 bytes) inline (hot)
@ 3 java.util.Objects::requireNonNull (14 bytes) inline (hot)
@ 9 java.util.stream.AbstractPipeline::wrapSink (37 bytes) inline (hot)
@ 1 java.util.Objects::requireNonNull (14 bytes) inline (hot)
@ 23 java.util.stream.IntPipeline$3::opWrapSink (10 bytes) inline (hot)
\-> TypeProfile (4868/4868 counts) = java/util/stream/IntPipeline$3
@ 6 java.util.stream.IntPipeline$3$1::<init> (11 bytes) inline (hot)
@ 7 java.util.stream.Sink$ChainedInt::<init> (16 bytes) inline (hot)
@ 1 java.lang.Object::<init> (1 bytes) inline (hot)
@ 6 java.util.Objects::requireNonNull (14 bytes) inline (hot)
@ 13 java.util.stream.AbstractPipeline::copyInto (53 bytes) inline (hot)
@ 1 java.util.Objects::requireNonNull (14 bytes) inline (hot)
@ 9 java.util.stream.AbstractPipeline::getStreamAndOpFlags (5 bytes) accessor
@ 12 java.util.stream.StreamOpFlag::isKnown (19 bytes) inline (hot)
@ 20 java.util.Spliterator::getExactSizeIfKnown (25 bytes) inline (hot)
\-> TypeProfile (4870/4870 counts) = java/util/Spliterators$IntArraySpliterator
@ 1 java.util.Spliterators$IntArraySpliterator::characteristics (5 bytes) accessor
@ 19 java.util.Spliterators$IntArraySpliterator::estimateSize (11 bytes) inline (hot)
@ 25 java.util.stream.Sink$ChainedInt::begin (11 bytes) inline (hot)
\-> TypeProfile (4870/4870 counts) = java/util/stream/IntPipeline$3$1
@ 5 java.util.stream.ReduceOps$5ReducingSink::begin (9 bytes) inline (hot)
\-> TypeProfile (4871/4871 counts) = java/util/stream/ReduceOps$5ReducingSink
@ 32 java.util.Spliterator$OfInt::forEachRemaining (53 bytes) inline (hot)
@ 12 java.util.Spliterators$IntArraySpliterator::forEachRemaining (68 bytes) inline (hot)
@ 53 java.util.stream.IntPipeline$3$1::accept (23 bytes) inline (hot)
@ 12 test.Measure$$Lambda$2/1745776415::applyAsInt (9 bytes) inline (hot)
\-> TypeProfile (1048107/1048107 counts) = test/Measure$$Lambda$2
@ 5 test.Measure::lambda$setup$1 (4 bytes) inlining too deep
@ 17 java.util.stream.ReduceOps$5ReducingSink::accept (19 bytes) inline (hot)
\-> TypeProfile (1048107/1048107 counts) = java/util/stream/ReduceOps$5ReducingSink
@ 10 java.util.stream.IntPipeline$$Lambda$3/1779653790::applyAsInt (6 bytes) inlining too deep
\-> TypeProfile (1048064/1048064 counts) = java/util/stream/IntPipeline$$Lambda$3
@ 53 java.util.stream.IntPipeline$3$1::accept (23 bytes) inline (hot)
@ 12 test.Measure$$Lambda$2/1745776415::applyAsInt (9 bytes) inline (hot)
\-> TypeProfile (1048107/1048107 counts) = test/Measure$$Lambda$2
@ 5 test.Measure::lambda$setup$1 (4 bytes) inlining too deep
@ 17 java.util.stream.ReduceOps$5ReducingSink::accept (19 bytes) inline (hot)
\-> TypeProfile (1048107/1048107 counts) = java/util/stream/ReduceOps$5ReducingSink
@ 10 java.util.stream.IntPipeline$$Lambda$3/1779653790::applyAsInt (6 bytes) inlining too deep
\-> TypeProfile (1048064/1048064 counts) = java/util/stream/IntPipeline$$Lambda$3
@ 38 java.util.stream.Sink$ChainedInt::end (10 bytes) inline (hot)
@ 4 java.util.stream.Sink::end (1 bytes) inline (hot)
\-> TypeProfile (5120/5120 counts) = java/util/stream/ReduceOps$5ReducingSink
@ 12 java.util.stream.ReduceOps$5ReducingSink::get (5 bytes) inline (hot)
@ 1 java.util.stream.ReduceOps$5ReducingSink::get (8 bytes) inline (hot)
@ 4 java.lang.Integer::valueOf (32 bytes) inline (hot)
@ 28 java.lang.Integer::<init> (10 bytes) inline (hot)
@ 1 java.lang.Number::<init> (5 bytes) inline (hot)
@ 1 java.lang.Object::<init> (1 bytes) inline (hot)
@ 12 java.lang.Integer::intValue (5 bytes) accessor
Note the "inlining too deep" that occurs for methods in the hot loop.
Even later on the generated JMH measurement loop is compiled:
26857 685 3 test.generated.Measure_multiply::multiply_avgt_jmhLoop (55 bytes)
@ 7 java.lang.System::nanoTime (0 bytes) intrinsic
@ 16 test.Measure::multiply (23 bytes)
@ 4 java.util.Arrays::stream (8 bytes)
@ 4 java.util.Arrays::stream (11 bytes)
@ 3 java.util.Arrays::spliterator (10 bytes)
@ 6 java.util.Spliterators::spliterator (25 bytes) callee is too large
@ 7 java.util.stream.StreamSupport::intStream (14 bytes)
@ 6 java.util.stream.StreamOpFlag::fromCharacteristics (37 bytes) callee is too large
@ 10 java.util.stream.IntPipeline$Head::<init> (8 bytes)
@ 4 java.util.stream.IntPipeline::<init> (8 bytes)
@ 4 java.util.stream.AbstractPipeline::<init> (55 bytes) callee is too large
@ 11 java.util.stream.IntPipeline::map (26 bytes)
@ 1 java.util.Objects::requireNonNull (14 bytes)
@ 8 java.lang.NullPointerException::<init> (5 bytes) don't inline Throwable constructors
@ 22 java.util.stream.IntPipeline$3::<init> (20 bytes)
@ 16 java.util.stream.IntPipeline$StatelessOp::<init> (29 bytes) callee is too large
@ 16 java.util.stream.IntPipeline::sum (11 bytes)
@ 2 java.lang.invoke.LambdaForm$MH/1279902262::linkToTargetMethod (8 bytes) force inline by annotation
@ 4 java.lang.invoke.LambdaForm$MH/1847865997::identity (18 bytes) force inline by annotation
@ 14 java.lang.invoke.LambdaForm$DMH/2024969684::invokeStatic_L_L (14 bytes) force inline by annotation
@ 1 java.lang.invoke.DirectMethodHandle::internalMemberName (8 bytes) force inline by annotation
@ 10 sun.invoke.util.ValueConversions::identity (2 bytes)
@ 7 java.util.stream.IntPipeline::reduce (16 bytes)
@ 3 java.util.stream.ReduceOps::makeInt (18 bytes)
@ 1 java.util.Objects::requireNonNull (14 bytes)
@ 14 java.util.stream.ReduceOps$5::<init> (16 bytes)
@ 12 java.util.stream.ReduceOps$ReduceOp::<init> (10 bytes)
@ 1 java.lang.Object::<init> (1 bytes)
@ 6 java.util.stream.AbstractPipeline::evaluate (94 bytes) callee is too large
@ 12 java.lang.Integer::intValue (5 bytes)
Note that there is no attempt to inline the whole stream pipeline, it stops well before it reached the hot loop, see "callee is too large", thereby re-optimizing the hot loop.
The inline limit can be increased to avoid such behaviour, for example -XX:MaxInlineLevel=12
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With