I am using Java 8 streams to iterate over a list with sublists. The outer list size varies between 100 to 1000 (different test runs) and the inner list size is always 5.
There are 2 benchmark runs which show unexpected performance deviations.
package benchmark; import org.openjdk.jmh.annotations.*; import org.openjdk.jmh.infra.Blackhole; import java.io.IOException; import java.util.concurrent.ThreadLocalRandom; import java.util.*; import java.util.function.*; import java.util.stream.*; @Threads(32) @Warmup(iterations = 25) @Measurement(iterations = 5) @State(Scope.Benchmark) @Fork(1) @BenchmarkMode(Mode.Throughput) public class StreamBenchmark { @Param({"700", "600", "500", "400", "300", "200", "100"}) int outerListSizeParam; final static int INNER_LIST_SIZE = 5; List<List<Integer>> list; Random rand() { return ThreadLocalRandom.current(); } final BinaryOperator<Integer> reducer = (val1, val2) -> val1 + val2; final Supplier<List<Integer>> supplier = () -> IntStream .range(0, INNER_LIST_SIZE) .mapToObj(ptr -> rand().nextInt(100)) .collect(Collectors.toList()); @Setup public void init() throws IOException { list = IntStream .range(0, outerListSizeParam) .mapToObj(i -> supplier.get()) .collect(Collectors.toList()); } @Benchmark public void loop(Blackhole bh) throws Exception { List<List<Integer>> res = new ArrayList<>(); for (List<Integer> innerList : list) { if (innerList.stream().reduce(reducer).orElse(0) == rand().nextInt(2000)) { res.add(innerList); } } bh.consume(res); } @Benchmark public void stream(Blackhole bh) throws Exception { List<List<Integer>> res = list .stream() .filter(innerList -> innerList.stream().reduce(reducer).orElse(0) == rand().nextInt(2000)) .collect(Collectors.toList()); bh.consume(res); } }
Run 1
Benchmark (outerListSizeParam) Mode Cnt Score Error Units StreamBenchmark.loop 700 thrpt 5 22488.601 ? 1128.543 ops/s StreamBenchmark.loop 600 thrpt 5 26010.430 ? 1161.854 ops/s StreamBenchmark.loop 500 thrpt 5 361837.395 ? 12777.016 ops/s StreamBenchmark.loop 400 thrpt 5 451774.457 ? 22517.801 ops/s StreamBenchmark.loop 300 thrpt 5 744677.723 ? 23456.913 ops/s StreamBenchmark.loop 200 thrpt 5 1102075.707 ? 38678.994 ops/s StreamBenchmark.loop 100 thrpt 5 2334981.090 ? 100973.551 ops/s StreamBenchmark.stream 700 thrpt 5 22320.346 ? 496.432 ops/s StreamBenchmark.stream 600 thrpt 5 26091.609 ? 1044.868 ops/s StreamBenchmark.stream 500 thrpt 5 31961.096 ? 497.854 ops/s StreamBenchmark.stream 400 thrpt 5 377701.859 ? 11115.990 ops/s StreamBenchmark.stream 300 thrpt 5 53887.652 ? 1228.245 ops/s StreamBenchmark.stream 200 thrpt 5 78754.294 ? 2173.316 ops/s StreamBenchmark.stream 100 thrpt 5 1564899.788 ? 47369.698 ops/s
Run 2
Benchmark (outerListSizeParam) Mode Cnt Score Error Units StreamBenchmark.loop 1000 thrpt 10 16179.702 ? 260.134 ops/s StreamBenchmark.loop 700 thrpt 10 22924.319 ? 329.134 ops/s StreamBenchmark.loop 600 thrpt 10 26871.267 ? 416.464 ops/s StreamBenchmark.loop 500 thrpt 10 353043.221 ? 6628.980 ops/s StreamBenchmark.loop 300 thrpt 10 772234.261 ? 10075.536 ops/s StreamBenchmark.loop 100 thrpt 10 2357125.442 ? 30824.834 ops/s StreamBenchmark.stream 1000 thrpt 10 15526.423 ? 147.454 ops/s StreamBenchmark.stream 700 thrpt 10 22347.898 ? 117.360 ops/s StreamBenchmark.stream 600 thrpt 10 26172.790 ? 229.745 ops/s StreamBenchmark.stream 500 thrpt 10 31643.518 ? 428.680 ops/s StreamBenchmark.stream 300 thrpt 10 536037.041 ? 6176.192 ops/s StreamBenchmark.stream 100 thrpt 10 153619.054 ? 1450.839 ops/s
I have two questions:
It looks like the JIT sometimes makes suboptimal optimization decisions, causing a huge performance drop.
The test machine has 128GB RAM and 32 CPU cores:
java version "1.8.0_45" Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Model name: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz Stepping: 4 CPU MHz: 1201.078 CPU max MHz: 3400.0000 CPU min MHz: 1200.0000 BogoMIPS: 5201.67 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31
P.S. Added benchmark with no stream. These tests (loop + stream + pureLoop) make me think that using streams and lambdas would require a lot of micro optimisation efforts and does not guarantee consistent performance anyway.
@Benchmark public void pureLoop(Blackhole bh) throws Exception { List<List<Integer>> res = new ArrayList<>(); for (List<Integer> innerList : list) { int sum = 0; for (Integer i : innerList) { sum += i; } if (sum == rand().nextInt(2000)) res.add(innerList); } bh.consume(res); }
Run 3 (pure loops)
Benchmark (outerListSizeParam) Mode Cnt Score Error Units StreamBenchmark.loop 1000 thrpt 5 15848.277 ? 445.624 ops/s StreamBenchmark.loop 700 thrpt 5 22330.289 ? 484.554 ops/s StreamBenchmark.loop 600 thrpt 5 26353.565 ? 631.421 ops/s StreamBenchmark.loop 500 thrpt 5 358144.956 ? 8273.981 ops/s StreamBenchmark.loop 400 thrpt 5 591471.382 ? 17725.212 ops/s StreamBenchmark.loop 300 thrpt 5 785458.022 ? 23775.650 ops/s StreamBenchmark.loop 200 thrpt 5 1192328.880 ? 40006.056 ops/s StreamBenchmark.loop 100 thrpt 5 2330555.766 ? 73143.081 ops/s StreamBenchmark.pureLoop 1000 thrpt 5 1024629.128 ? 4387.106 ops/s StreamBenchmark.pureLoop 700 thrpt 5 1495365.029 ? 31659.941 ops/s StreamBenchmark.pureLoop 600 thrpt 5 1787432.825 ? 16611.868 ops/s StreamBenchmark.pureLoop 500 thrpt 5 2087093.023 ? 20143.165 ops/s StreamBenchmark.pureLoop 400 thrpt 5 2662946.999 ? 33326.079 ops/s StreamBenchmark.pureLoop 300 thrpt 5 3657830.227 ? 55020.775 ops/s StreamBenchmark.pureLoop 200 thrpt 5 5365706.786 ? 64404.783 ops/s StreamBenchmark.pureLoop 100 thrpt 5 10477430.730 ? 187641.413 ops/s StreamBenchmark.stream 1000 thrpt 5 15576.304 ? 250.620 ops/s StreamBenchmark.stream 700 thrpt 5 22286.965 ? 1153.734 ops/s StreamBenchmark.stream 600 thrpt 5 26109.258 ? 296.382 ops/s StreamBenchmark.stream 500 thrpt 5 31343.986 ? 1270.210 ops/s StreamBenchmark.stream 400 thrpt 5 39696.775 ? 1812.355 ops/s StreamBenchmark.stream 300 thrpt 5 536932.353 ? 41249.909 ops/s StreamBenchmark.stream 200 thrpt 5 77797.301 ? 976.641 ops/s StreamBenchmark.stream 100 thrpt 5 155387.348 ? 3182.841 ops/s
Solution: as recommended by apangin disabling tiered compilation made JIT results stable.
java -XX:-TieredCompilation -jar test-jmh.jar Benchmark (outerListSizeParam) Mode Cnt Score Error Units StreamBenchmark.loop 1000 thrpt 5 160410.288 ? 4426.320 ops/s StreamBenchmark.loop 700 thrpt 5 230524.018 ? 4426.740 ops/s StreamBenchmark.loop 600 thrpt 5 266266.663 ? 9078.827 ops/s StreamBenchmark.loop 500 thrpt 5 324182.307 ? 8452.368 ops/s StreamBenchmark.loop 400 thrpt 5 400793.677 ? 12526.475 ops/s StreamBenchmark.loop 300 thrpt 5 534618.231 ? 25616.352 ops/s StreamBenchmark.loop 200 thrpt 5 803314.614 ? 33108.005 ops/s StreamBenchmark.loop 100 thrpt 5 1827400.764 ? 13868.253 ops/s StreamBenchmark.pureLoop 1000 thrpt 5 1126873.129 ? 33307.600 ops/s StreamBenchmark.pureLoop 700 thrpt 5 1560200.150 ? 150146.319 ops/s StreamBenchmark.pureLoop 600 thrpt 5 1848113.823 ? 16195.103 ops/s StreamBenchmark.pureLoop 500 thrpt 5 2250201.116 ? 130995.240 ops/s StreamBenchmark.pureLoop 400 thrpt 5 2839212.063 ? 142008.523 ops/s StreamBenchmark.pureLoop 300 thrpt 5 3807436.825 ? 140612.798 ops/s StreamBenchmark.pureLoop 200 thrpt 5 5724311.256 ? 77031.417 ops/s StreamBenchmark.pureLoop 100 thrpt 5 11718427.224 ? 101424.952 ops/s StreamBenchmark.stream 1000 thrpt 5 16186.121 ? 249.806 ops/s StreamBenchmark.stream 700 thrpt 5 22071.884 ? 703.729 ops/s StreamBenchmark.stream 600 thrpt 5 25546.378 ? 472.804 ops/s StreamBenchmark.stream 500 thrpt 5 32271.659 ? 437.048 ops/s StreamBenchmark.stream 400 thrpt 5 39755.841 ? 506.207 ops/s StreamBenchmark.stream 300 thrpt 5 52309.706 ? 1271.206 ops/s StreamBenchmark.stream 200 thrpt 5 79277.532 ? 2040.740 ops/s StreamBenchmark.stream 100 thrpt 5 161244.347 ? 3882.619 ops/s
Streams are lazy because intermediate operations are not evaluated unless terminal operation is invoked. Each intermediate operation creates a new stream, stores the provided operation/function and return the new stream.
In Java8 Streams, performance is achieved by parallelism, laziness, and using short-circuit operations, but there is a downside as well, and we need to be very cautious while choosing Streams, as it may degrade the performance of your application.
Similarly, don't use parallel if the stream is ordered and has much more elements than you want to process, e.g. This may run much longer because the parallel threads may work on plenty of number ranges instead of the crucial one 0-100, causing this to take very long time.
Yes, streams are sometimes slower than loops, but they can also be equally fast; it depends on the circumstances. The point to take home is that sequential streams are no faster than loops.
This effect is caused by Type Profile Pollution. Let me explain on a simplified benchmark:
@State(Scope.Benchmark) public class Streams { @Param({"500", "520"}) int iterations; @Setup public void init() { for (int i = 0; i < iterations; i++) { Stream.empty().reduce((x, y) -> x); } } @Benchmark public long loop() { return Stream.empty().count(); } }
Though iteration
parameter here changes very slightly and it does not affect the main benchmark loop, the results expose very surprising 2.5x performance degradation:
Benchmark (iterations) Mode Cnt Score Error Units Streams.loop 500 thrpt 5 29491,039 ± 240,953 ops/ms Streams.loop 520 thrpt 5 11867,860 ± 344,779 ops/ms
Now let's run JMH with -prof perfasm
option to see the hottest code regions:
Fast case (iterations = 500):
....[Hottest Methods (after inlining)].................................. 48,66% bench.generated.Streams_loop::loop_thrpt_jmhStub 23,14% <unknown> 2,99% java.util.stream.Sink$ChainedReference::<init> 1,98% org.openjdk.jmh.infra.Blackhole::consume 1,68% java.util.Objects::requireNonNull 0,65% java.util.stream.AbstractPipeline::evaluate
Slow case (iterations = 520):
....[Hottest Methods (after inlining)].................................. 40,09% java.util.stream.ReduceOps$ReduceOp::evaluateSequential 22,02% <unknown> 17,61% bench.generated.Streams_loop::loop_thrpt_jmhStub 1,25% org.openjdk.jmh.infra.Blackhole::consume 0,74% java.util.stream.AbstractPipeline::evaluate
Looks like the slow case spends the most time in ReduceOp.evaluateSequential
method that is not inlined. Furthermore, if we study the assembly code for this method we'll find that the longest operation is checkcast
.
You know how HotSpot compiler works: before the JIT starts, a method is executed in interpreter for some time to collect the profile data, e.g. what methods are called, what classes are seen, what branches are taken etc. With Tiered compilation the profile is also collected in C1-compiled code. The profile is then used to generate C2-optimizied code. However if the application changes execution pattern in the middle, the generated code may be not optimal for the modified behavior.
Let's use -XX:+PrintMethodData
(available in debug JVM) to compare the execution profiles:
----- Fast case ----- java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object; interpreter_invocation_count: 13382 invocation_counter: 13382 backedge_counter: 0 mdo size: 552 bytes 0 aload_1 1 fast_aload_0 2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;> 0 bci: 2 VirtualCallData count(0) entries(1) 'java/util/stream/ReduceOps$8'(12870 1.00) 5 aload_2 6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;> 48 bci: 6 VirtualCallData count(0) entries(1) 'java/util/stream/ReferencePipeline$5'(12870 1.00) 9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink> 96 bci: 9 ReceiverTypeData count(0) entries(1) 'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00) 12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;> 144 bci: 12 VirtualCallData count(0) entries(1) 'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00) 17 areturn ----- Slow case ----- java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object; interpreter_invocation_count: 54751 invocation_counter: 54751 backedge_counter: 0 mdo size: 552 bytes 0 aload_1 1 fast_aload_0 2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;> 0 bci: 2 VirtualCallData count(0) entries(2) 'java/util/stream/ReduceOps$2'(16 0.00) 'java/util/stream/ReduceOps$8'(54223 1.00) 5 aload_2 6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;> 48 bci: 6 VirtualCallData count(0) entries(2) 'java/util/stream/ReferencePipeline$Head'(16 0.00) 'java/util/stream/ReferencePipeline$5'(54223 1.00) 9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink> 96 bci: 9 ReceiverTypeData count(0) entries(2) 'java/util/stream/ReduceOps$2ReducingSink'(16 0.00) 'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00) 12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;> 144 bci: 12 VirtualCallData count(0) entries(2) 'java/util/stream/ReduceOps$2ReducingSink'(16 0.00) 'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00) 17 areturn
You see, the initialization loop ran too long that its statistics appeared in the execution profile: all virtual methods have two implementations and checkcast has also two different entries. In the fast case the profile is not polluted: all sites are monomorphic, and JIT can easily inline and optimize them.
The same is true for your original benchmark: longer stream operations in init()
method polluted the profile. If you play with profile and tiered compilation options, the results can be quite different. For example, try
-XX:-ProfileInterpreter
-XX:Tier3InvocationThreshold=1000
-XX:-TieredCompilation
Finally, this problem is not unique. There are already multiple JVM bugs related to performance regressions due to profile pollution: JDK-8015416, JDK-8015417, JDK-8059879... Hope this will be improved in Java 9.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With