Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java 8 stream unpredictable performance drop with no obvious reason

I am using Java 8 streams to iterate over a list with sublists. The outer list size varies between 100 to 1000 (different test runs) and the inner list size is always 5.

There are 2 benchmark runs which show unexpected performance deviations.

package benchmark;  import org.openjdk.jmh.annotations.*; import org.openjdk.jmh.infra.Blackhole;  import java.io.IOException; import java.util.concurrent.ThreadLocalRandom; import java.util.*; import java.util.function.*; import java.util.stream.*;  @Threads(32) @Warmup(iterations = 25) @Measurement(iterations = 5) @State(Scope.Benchmark) @Fork(1) @BenchmarkMode(Mode.Throughput) public class StreamBenchmark {     @Param({"700", "600", "500", "400", "300", "200", "100"})     int outerListSizeParam;     final static int INNER_LIST_SIZE = 5;     List<List<Integer>> list;      Random rand() {         return ThreadLocalRandom.current();     }      final BinaryOperator<Integer> reducer = (val1, val2) -> val1 + val2;      final Supplier<List<Integer>> supplier = () -> IntStream             .range(0, INNER_LIST_SIZE)             .mapToObj(ptr -> rand().nextInt(100))             .collect(Collectors.toList());      @Setup     public void init() throws IOException {         list = IntStream                 .range(0, outerListSizeParam)                 .mapToObj(i -> supplier.get())                 .collect(Collectors.toList());     }      @Benchmark     public void loop(Blackhole bh) throws Exception {         List<List<Integer>> res = new ArrayList<>();         for (List<Integer> innerList : list) {             if (innerList.stream().reduce(reducer).orElse(0) == rand().nextInt(2000)) {                 res.add(innerList);             }         }          bh.consume(res);     }      @Benchmark     public void stream(Blackhole bh) throws Exception {         List<List<Integer>> res = list                 .stream()                 .filter(innerList -> innerList.stream().reduce(reducer).orElse(0) == rand().nextInt(2000))                 .collect(Collectors.toList());          bh.consume(res);     } } 

Run 1

    Benchmark               (outerListSizeParam)   Mode  Cnt        Score        Error  Units     StreamBenchmark.loop                     700  thrpt    5    22488.601 ?   1128.543  ops/s     StreamBenchmark.loop                     600  thrpt    5    26010.430 ?   1161.854  ops/s     StreamBenchmark.loop                     500  thrpt    5   361837.395 ?  12777.016  ops/s     StreamBenchmark.loop                     400  thrpt    5   451774.457 ?  22517.801  ops/s     StreamBenchmark.loop                     300  thrpt    5   744677.723 ?  23456.913  ops/s     StreamBenchmark.loop                     200  thrpt    5  1102075.707 ?  38678.994  ops/s     StreamBenchmark.loop                     100  thrpt    5  2334981.090 ? 100973.551  ops/s     StreamBenchmark.stream                   700  thrpt    5    22320.346 ?    496.432  ops/s     StreamBenchmark.stream                   600  thrpt    5    26091.609 ?   1044.868  ops/s     StreamBenchmark.stream                   500  thrpt    5    31961.096 ?    497.854  ops/s     StreamBenchmark.stream                   400  thrpt    5   377701.859 ?  11115.990  ops/s     StreamBenchmark.stream                   300  thrpt    5    53887.652 ?   1228.245  ops/s     StreamBenchmark.stream                   200  thrpt    5    78754.294 ?   2173.316  ops/s     StreamBenchmark.stream                   100  thrpt    5  1564899.788 ?  47369.698  ops/s 

Run 2

    Benchmark               (outerListSizeParam)   Mode  Cnt        Score       Error  Units     StreamBenchmark.loop                    1000  thrpt   10    16179.702 ?   260.134  ops/s     StreamBenchmark.loop                     700  thrpt   10    22924.319 ?   329.134  ops/s     StreamBenchmark.loop                     600  thrpt   10    26871.267 ?   416.464  ops/s     StreamBenchmark.loop                     500  thrpt   10   353043.221 ?  6628.980  ops/s     StreamBenchmark.loop                     300  thrpt   10   772234.261 ? 10075.536  ops/s     StreamBenchmark.loop                     100  thrpt   10  2357125.442 ? 30824.834  ops/s     StreamBenchmark.stream                  1000  thrpt   10    15526.423 ?   147.454  ops/s     StreamBenchmark.stream                   700  thrpt   10    22347.898 ?   117.360  ops/s     StreamBenchmark.stream                   600  thrpt   10    26172.790 ?   229.745  ops/s     StreamBenchmark.stream                   500  thrpt   10    31643.518 ?   428.680  ops/s     StreamBenchmark.stream                   300  thrpt   10   536037.041 ?  6176.192  ops/s     StreamBenchmark.stream                   100  thrpt   10   153619.054 ?  1450.839  ops/s 

I have two questions:

  1. Why is there a consistent, significant performance difference between loop+500 and loop+600 for both test runs?
  2. Why in Run1 stream+400 and Run2 stream+300 is there a significant but inconsistent performance deviation?

It looks like the JIT sometimes makes suboptimal optimization decisions, causing a huge performance drop.

The test machine has 128GB RAM and 32 CPU cores:

    java version "1.8.0_45"     Java(TM) SE Runtime Environment (build 1.8.0_45-b14)     Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)      Architecture:          x86_64     CPU op-mode(s):        32-bit, 64-bit     Byte Order:            Little Endian     CPU(s):                32     On-line CPU(s) list:   0-31     Thread(s) per core:    2     Core(s) per socket:    8     Socket(s):             2     NUMA node(s):          2     Vendor ID:             GenuineIntel     CPU family:            6     Model:                 62     Model name:            Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz     Stepping:              4     CPU MHz:               1201.078     CPU max MHz:           3400.0000     CPU min MHz:           1200.0000     BogoMIPS:              5201.67     Virtualization:        VT-x     L1d cache:             32K     L1i cache:             32K     L2 cache:              256K     L3 cache:              20480K     NUMA node0 CPU(s):     0-7,16-23     NUMA node1 CPU(s):     8-15,24-31 

P.S. Added benchmark with no stream. These tests (loop + stream + pureLoop) make me think that using streams and lambdas would require a lot of micro optimisation efforts and does not guarantee consistent performance anyway.

@Benchmark public void pureLoop(Blackhole bh) throws Exception {     List<List<Integer>> res = new ArrayList<>();     for (List<Integer> innerList : list) {         int sum = 0;         for (Integer i : innerList) {             sum += i;         }         if (sum == rand().nextInt(2000))             res.add(innerList);     }      bh.consume(res); } 

Run 3 (pure loops)

Benchmark                 (outerListSizeParam)   Mode  Cnt         Score        Error  Units StreamBenchmark.loop                      1000  thrpt    5     15848.277 ?    445.624  ops/s StreamBenchmark.loop                       700  thrpt    5     22330.289 ?    484.554  ops/s StreamBenchmark.loop                       600  thrpt    5     26353.565 ?    631.421  ops/s StreamBenchmark.loop                       500  thrpt    5    358144.956 ?   8273.981  ops/s StreamBenchmark.loop                       400  thrpt    5    591471.382 ?  17725.212  ops/s StreamBenchmark.loop                       300  thrpt    5    785458.022 ?  23775.650  ops/s StreamBenchmark.loop                       200  thrpt    5   1192328.880 ?  40006.056  ops/s StreamBenchmark.loop                       100  thrpt    5   2330555.766 ?  73143.081  ops/s StreamBenchmark.pureLoop                  1000  thrpt    5   1024629.128 ?   4387.106  ops/s StreamBenchmark.pureLoop                   700  thrpt    5   1495365.029 ?  31659.941  ops/s StreamBenchmark.pureLoop                   600  thrpt    5   1787432.825 ?  16611.868  ops/s StreamBenchmark.pureLoop                   500  thrpt    5   2087093.023 ?  20143.165  ops/s StreamBenchmark.pureLoop                   400  thrpt    5   2662946.999 ?  33326.079  ops/s StreamBenchmark.pureLoop                   300  thrpt    5   3657830.227 ?  55020.775  ops/s StreamBenchmark.pureLoop                   200  thrpt    5   5365706.786 ?  64404.783  ops/s StreamBenchmark.pureLoop                   100  thrpt    5  10477430.730 ? 187641.413  ops/s StreamBenchmark.stream                    1000  thrpt    5     15576.304 ?    250.620  ops/s StreamBenchmark.stream                     700  thrpt    5     22286.965 ?   1153.734  ops/s StreamBenchmark.stream                     600  thrpt    5     26109.258 ?    296.382  ops/s StreamBenchmark.stream                     500  thrpt    5     31343.986 ?   1270.210  ops/s StreamBenchmark.stream                     400  thrpt    5     39696.775 ?   1812.355  ops/s StreamBenchmark.stream                     300  thrpt    5    536932.353 ?  41249.909  ops/s StreamBenchmark.stream                     200  thrpt    5     77797.301 ?    976.641  ops/s StreamBenchmark.stream                     100  thrpt    5    155387.348 ?   3182.841  ops/s 

Solution: as recommended by apangin disabling tiered compilation made JIT results stable.

java -XX:-TieredCompilation -jar test-jmh.jar  Benchmark                 (outerListSizeParam)   Mode  Cnt         Score        Error  Units StreamBenchmark.loop                      1000  thrpt    5    160410.288 ?   4426.320  ops/s StreamBenchmark.loop                       700  thrpt    5    230524.018 ?   4426.740  ops/s StreamBenchmark.loop                       600  thrpt    5    266266.663 ?   9078.827  ops/s StreamBenchmark.loop                       500  thrpt    5    324182.307 ?   8452.368  ops/s StreamBenchmark.loop                       400  thrpt    5    400793.677 ?  12526.475  ops/s StreamBenchmark.loop                       300  thrpt    5    534618.231 ?  25616.352  ops/s StreamBenchmark.loop                       200  thrpt    5    803314.614 ?  33108.005  ops/s StreamBenchmark.loop                       100  thrpt    5   1827400.764 ?  13868.253  ops/s StreamBenchmark.pureLoop                  1000  thrpt    5   1126873.129 ?  33307.600  ops/s StreamBenchmark.pureLoop                   700  thrpt    5   1560200.150 ? 150146.319  ops/s StreamBenchmark.pureLoop                   600  thrpt    5   1848113.823 ?  16195.103  ops/s StreamBenchmark.pureLoop                   500  thrpt    5   2250201.116 ? 130995.240  ops/s StreamBenchmark.pureLoop                   400  thrpt    5   2839212.063 ? 142008.523  ops/s StreamBenchmark.pureLoop                   300  thrpt    5   3807436.825 ? 140612.798  ops/s StreamBenchmark.pureLoop                   200  thrpt    5   5724311.256 ?  77031.417  ops/s StreamBenchmark.pureLoop                   100  thrpt    5  11718427.224 ? 101424.952  ops/s StreamBenchmark.stream                    1000  thrpt    5     16186.121 ?    249.806  ops/s StreamBenchmark.stream                     700  thrpt    5     22071.884 ?    703.729  ops/s StreamBenchmark.stream                     600  thrpt    5     25546.378 ?    472.804  ops/s StreamBenchmark.stream                     500  thrpt    5     32271.659 ?    437.048  ops/s StreamBenchmark.stream                     400  thrpt    5     39755.841 ?    506.207  ops/s StreamBenchmark.stream                     300  thrpt    5     52309.706 ?   1271.206  ops/s StreamBenchmark.stream                     200  thrpt    5     79277.532 ?   2040.740  ops/s StreamBenchmark.stream                     100  thrpt    5    161244.347 ?   3882.619  ops/s 
like image 416
alexd84 Avatar asked Jun 06 '15 21:06

alexd84


People also ask

Why is Java 8 stream lazy?

Streams are lazy because intermediate operations are not evaluated unless terminal operation is invoked. Each intermediate operation creates a new stream, stores the provided operation/function and return the new stream.

Does Java 8 stream improve performance?

In Java8 Streams, performance is achieved by parallelism, laziness, and using short-circuit operations, but there is a downside as well, and we need to be very cautious while choosing Streams, as it may degrade the performance of your application.

When should you not use Parallelstream?

Similarly, don't use parallel if the stream is ordered and has much more elements than you want to process, e.g. This may run much longer because the parallel threads may work on plenty of number ranges instead of the crucial one 0-100, causing this to take very long time.

Is Java stream slower than for loop?

Yes, streams are sometimes slower than loops, but they can also be equally fast; it depends on the circumstances. The point to take home is that sequential streams are no faster than loops.


1 Answers

This effect is caused by Type Profile Pollution. Let me explain on a simplified benchmark:

@State(Scope.Benchmark) public class Streams {     @Param({"500", "520"})     int iterations;      @Setup     public void init() {         for (int i = 0; i < iterations; i++) {             Stream.empty().reduce((x, y) -> x);         }     }      @Benchmark     public long loop() {         return Stream.empty().count();     } } 

Though iteration parameter here changes very slightly and it does not affect the main benchmark loop, the results expose very surprising 2.5x performance degradation:

Benchmark     (iterations)   Mode  Cnt      Score     Error   Units Streams.loop           500  thrpt    5  29491,039 ± 240,953  ops/ms Streams.loop           520  thrpt    5  11867,860 ± 344,779  ops/ms 

Now let's run JMH with -prof perfasm option to see the hottest code regions:

Fast case (iterations = 500):

....[Hottest Methods (after inlining)]..................................  48,66%  bench.generated.Streams_loop::loop_thrpt_jmhStub  23,14%  <unknown>   2,99%  java.util.stream.Sink$ChainedReference::<init>   1,98%  org.openjdk.jmh.infra.Blackhole::consume   1,68%  java.util.Objects::requireNonNull   0,65%  java.util.stream.AbstractPipeline::evaluate 

Slow case (iterations = 520):

....[Hottest Methods (after inlining)]..................................  40,09%  java.util.stream.ReduceOps$ReduceOp::evaluateSequential  22,02%  <unknown>  17,61%  bench.generated.Streams_loop::loop_thrpt_jmhStub   1,25%  org.openjdk.jmh.infra.Blackhole::consume   0,74%  java.util.stream.AbstractPipeline::evaluate 

Looks like the slow case spends the most time in ReduceOp.evaluateSequential method that is not inlined. Furthermore, if we study the assembly code for this method we'll find that the longest operation is checkcast.

You know how HotSpot compiler works: before the JIT starts, a method is executed in interpreter for some time to collect the profile data, e.g. what methods are called, what classes are seen, what branches are taken etc. With Tiered compilation the profile is also collected in C1-compiled code. The profile is then used to generate C2-optimizied code. However if the application changes execution pattern in the middle, the generated code may be not optimal for the modified behavior.

Let's use -XX:+PrintMethodData (available in debug JVM) to compare the execution profiles:

----- Fast case ----- java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;   interpreter_invocation_count:    13382    invocation_counter:              13382    backedge_counter:                    0    mdo size: 552 bytes  0 aload_1 1 fast_aload_0 2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;>    0   bci: 2    VirtualCallData     count(0) entries(1)                                     'java/util/stream/ReduceOps$8'(12870 1.00) 5 aload_2 6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;>    48  bci: 6    VirtualCallData     count(0) entries(1)                                     'java/util/stream/ReferencePipeline$5'(12870 1.00) 9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink>   96  bci: 9    ReceiverTypeData    count(0) entries(1)                                     'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00) 12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;>    144 bci: 12   VirtualCallData     count(0) entries(1)                                     'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00) 17 areturn  ----- Slow case ----- java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;   interpreter_invocation_count:    54751    invocation_counter:              54751    backedge_counter:                    0    mdo size: 552 bytes  0 aload_1 1 fast_aload_0 2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;>    0   bci: 2    VirtualCallData     count(0) entries(2)                                     'java/util/stream/ReduceOps$2'(16 0.00)                                     'java/util/stream/ReduceOps$8'(54223 1.00) 5 aload_2 6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;>    48  bci: 6    VirtualCallData     count(0) entries(2)                                     'java/util/stream/ReferencePipeline$Head'(16 0.00)                                     'java/util/stream/ReferencePipeline$5'(54223 1.00) 9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink>   96  bci: 9    ReceiverTypeData    count(0) entries(2)                                     'java/util/stream/ReduceOps$2ReducingSink'(16 0.00)                                     'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00) 12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;>    144 bci: 12   VirtualCallData     count(0) entries(2)                                     'java/util/stream/ReduceOps$2ReducingSink'(16 0.00)                                     'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00) 17 areturn 

You see, the initialization loop ran too long that its statistics appeared in the execution profile: all virtual methods have two implementations and checkcast has also two different entries. In the fast case the profile is not polluted: all sites are monomorphic, and JIT can easily inline and optimize them.

The same is true for your original benchmark: longer stream operations in init() method polluted the profile. If you play with profile and tiered compilation options, the results can be quite different. For example, try

  1. -XX:-ProfileInterpreter
  2. -XX:Tier3InvocationThreshold=1000
  3. -XX:-TieredCompilation

Finally, this problem is not unique. There are already multiple JVM bugs related to performance regressions due to profile pollution: JDK-8015416, JDK-8015417, JDK-8059879... Hope this will be improved in Java 9.

like image 99
apangin Avatar answered Sep 19 '22 17:09

apangin