Consider the following two snippets of code on an array of length 2: <pre class="prettyprint"><code>boolean isOK(int i) { for (int j = 0; j < filters.length; ++j) { if (!filters[j].isOK(i)) { return false; } } return true; } </code></pre> and <pre class="prettyprint"><code>boolean isOK(int i) { return filters[0].isOK(i) && filters[1].isOK(i); } </code></pre> I would assume that the performance of these two pieces should be similar after sufficient warm-up. I've checked this using JMH micro-benchmarking framework as described e.g. here and here and observed that the second snippet is more than 10% faster. Question: why hasn't Java optimized my first snippet using the basic loop unrolling technique? In particular, I'd like to understand the following: <ol> <li>I can easily produce a code that is optimal for cases of 2 filters and still can work in case of another number of filters (imagine a simple builder): <code>return (filters.length) == 2 ? new FilterChain2(filters) : new FilterChain1(filters)</code>. Can JITC do the same and if not, why?</li> <li>Can JITC detect that 'filters.length==2' is the most frequent case and produce the code that is optimal for this case after some warm-up? This should be almost as optimal as the manually-unrolled version.</li> <li>Can JITC detect that a particular instance is used very frequently and then produce a code for this specific instance (for which it knows that the number of filters is always 2)? Update: got an answer that JITC works only on a class level. OK, got it.</li> </ol> Ideally, I would like to receive an answer from someone with a deep understanding of how JITC works. Benchmark run details: <ul> <li>Tried on latest versions of Java 8 OpenJDK and Oracle HotSpot, the results are similar</li> <li>Used Java flags: -Xmx4g -Xms4g -server -Xbatch -XX:CICompilerCount=2 (got similar results without the fancy flags as well)</li> <li>By the way, I get similar run time ratio if I simply run it several billion times in a loop (not via JMH), i.e. the second snippet is always clearly faster</li> </ul> Typical benchmark output: <blockquote> Benchmark (filterIndex) Mode Cnt Score Error Units LoopUnrollingBenchmark.runBenchmark 0 avgt 400 44.202 ± 0.224 ns/op LoopUnrollingBenchmark.runBenchmark 1 avgt 400 38.347 ± 0.063 ns/op </blockquote> (The first line corresponds to the first snippet, the second line - to the second. Complete benchmark code: <pre class="prettyprint"><code>public class LoopUnrollingBenchmark { @State(Scope.Benchmark) public static class BenchmarkData { public Filter[] filters; @Param({"0", "1"}) public int filterIndex; public int num; @Setup(Level.Invocation) //similar ratio with Level.TRIAL public void setUp() { filters = new Filter[]{new FilterChain1(), new FilterChain2()}; num = new Random().nextInt(); } } @Benchmark @Fork(warmups = 5, value = 20) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) public int runBenchmark(BenchmarkData data) { Filter filter = data.filters[data.filterIndex]; int sum = 0; int num = data.num; if (filter.isOK(num)) { ++sum; } if (filter.isOK(num + 1)) { ++sum; } if (filter.isOK(num - 1)) { ++sum; } if (filter.isOK(num * 2)) { ++sum; } if (filter.isOK(num * 3)) { ++sum; } if (filter.isOK(num * 5)) { ++sum; } return sum; } interface Filter { boolean isOK(int i); } static class Filter1 implements Filter { @Override public boolean isOK(int i) { return i % 3 == 1; } } static class Filter2 implements Filter { @Override public boolean isOK(int i) { return i % 7 == 3; } } static class FilterChain1 implements Filter { final Filter[] filters = createLeafFilters(); @Override public boolean isOK(int i) { for (int j = 0; j < filters.length; ++j) { if (!filters[j].isOK(i)) { return false; } } return true; } } static class FilterChain2 implements Filter { final Filter[] filters = createLeafFilters(); @Override public boolean isOK(int i) { return filters[0].isOK(i) && filters[1].isOK(i); } } private static Filter[] createLeafFilters() { Filter[] filters = new Filter[2]; filters[0] = new Filter1(); filters[1] = new Filter2(); return filters; } public static void main(String[] args) throws Exception { org.openjdk.jmh.Main.main(args); } } </code></pre>

TL;DR The main reason of performance difference here is not related to loop unrolling. It is rather the type speculation and the inline caches. <h3>Unrolling strategies</h3> In fact, in HotSpot terminology, such loops are treated as counted, and in certain cases JVM can unroll them. Not in your case though. HotSpot has two loop unrolling strategies: 1) unroll maximally, i.e. remove the loop altogether; or 2) glue several consecutive iterations together. Maximal unrolling can be done, only if the exact number of iterations is known. <pre class="prettyprint"><code> if (!cl->has_exact_trip_count()) { // Trip count is not exact. return false; } </code></pre> In your case, however, the function may return early after the first iteration. Partial unrolling could be probably applied, but the following condition breaks unrolling: <pre class="prettyprint"><code> // Don't unroll if the next round of unrolling would push us // over the expected trip count of the loop. One is subtracted // from the expected trip count because the pre-loop normally // executes 1 iteration. if (UnrollLimitForProfileCheck > 0 && cl->profile_trip_cnt() != COUNT_UNKNOWN && future_unroll_ct > UnrollLimitForProfileCheck && (float)future_unroll_ct > cl->profile_trip_cnt() - 1.0) { return false; } </code></pre> Since in your case the expected trip count is less than 2, HotSpot assumes it's not worthy to unroll even two iterations. Note that the first iteration is extracted into pre-loop anyway (loop peeling optimization), so unrolling is indeed not very benificial here. <h3>Type speculation</h3> In your unrolled version, there are two different <code>invokeinterface</code> bytecodes. These sites have two distinct type profiles. The first receiver is always <code>Filter1</code>, and the second receiver is always <code>Filter2</code>. So, you basically have two monomorphic call sites, and HotSpot can perfectly inline both calls - so called "inline cache" which has 100% hit ratio in this case. With the loop, there is just one <code>invokeinterface</code> bytecode, and only one type profile is collected. HotSpot JVM sees that <code>filters[j].isOK()</code> is called 86% times with <code>Filter1</code> receiver and 14% times with <code>Filter2</code> receiver. This will be a bimorphic call. Fortunately, HotSpot can speculatively inline bimorphic calls, too. It inlines both targets with a conditional branch. However, in this case the hit ratio will be at most 86%, and the performance will suffer from the corresponding mispredicted branches at the architecture level. Things will be even worse, if you have 3 or more different filters. In this case <code>isOK()</code> will be a megamorphic call which HotSpot cannot inline at all. So, the compiled code will contain a true interface call which has a larger performance impact. More about speculative inlining in the article The Black Magic of (Java) Method Dispatch. <h3>Conclusion</h3> In order to inline virtual/interface calls, HotSpot JVM collects type profiles per invoke bytecode. If there is a virtual call in a loop, there will be just one type profile for the call, no matter if the loop is unrolled or not. To get the best from the virtual call optimizations, you'd need to manually split the loop, primarily for the purpose of splitting type profiles. HotSpot cannot do this automatically so far.

Java: manually-unrolled loop is still faster than the original loop. Why?

Tags:

java

performance

optimization

jit

Consider the following two snippets of code on an array of length 2:

boolean isOK(int i) {
    for (int j = 0; j < filters.length; ++j) {
        if (!filters[j].isOK(i)) {
            return false;
        }
    }
    return true;
}

and

boolean isOK(int i) {
     return filters[0].isOK(i) && filters[1].isOK(i);
}

I would assume that the performance of these two pieces should be similar after sufficient warm-up.
I've checked this using JMH micro-benchmarking framework as described e.g. here and here and observed that the second snippet is more than 10% faster.

Question: why hasn't Java optimized my first snippet using the basic loop unrolling technique?
In particular, I'd like to understand the following:

I can easily produce a code that is optimal for cases of 2 filters and still can work in case of another number of filters (imagine a simple builder):
return (filters.length) == 2 ? new FilterChain2(filters) : new FilterChain1(filters). Can JITC do the same and if not, why?
Can JITC detect that 'filters.length==2' is the most frequent case and produce the code that is optimal for this case after some warm-up? This should be almost as optimal as the manually-unrolled version.
Can JITC detect that a particular instance is used very frequently and then produce a code for this specific instance (for which it knows that the number of filters is always 2)?
Update: got an answer that JITC works only on a class level. OK, got it.

Ideally, I would like to receive an answer from someone with a deep understanding of how JITC works.

Benchmark run details:

Tried on latest versions of Java 8 OpenJDK and Oracle HotSpot, the results are similar
Used Java flags: -Xmx4g -Xms4g -server -Xbatch -XX:CICompilerCount=2 (got similar results without the fancy flags as well)
By the way, I get similar run time ratio if I simply run it several billion times in a loop (not via JMH), i.e. the second snippet is always clearly faster

Typical benchmark output:

Benchmark (filterIndex) Mode Cnt Score Error Units
LoopUnrollingBenchmark.runBenchmark 0 avgt 400 44.202 ± 0.224 ns/op
LoopUnrollingBenchmark.runBenchmark 1 avgt 400 38.347 ± 0.063 ns/op

(The first line corresponds to the first snippet, the second line - to the second.

Complete benchmark code:

public class LoopUnrollingBenchmark {

    @State(Scope.Benchmark)
    public static class BenchmarkData {
        public Filter[] filters;
        @Param({"0", "1"})
        public int filterIndex;
        public int num;

        @Setup(Level.Invocation) //similar ratio with Level.TRIAL
        public void setUp() {
            filters = new Filter[]{new FilterChain1(), new FilterChain2()};
            num = new Random().nextInt();
        }
    }

    @Benchmark
    @Fork(warmups = 5, value = 20)
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public int runBenchmark(BenchmarkData data) {
        Filter filter = data.filters[data.filterIndex];
        int sum = 0;
        int num = data.num;
        if (filter.isOK(num)) {
            ++sum;
        }
        if (filter.isOK(num + 1)) {
            ++sum;
        }
        if (filter.isOK(num - 1)) {
            ++sum;
        }
        if (filter.isOK(num * 2)) {
            ++sum;
        }
        if (filter.isOK(num * 3)) {
            ++sum;
        }
        if (filter.isOK(num * 5)) {
            ++sum;
        }
        return sum;
    }


    interface Filter {
        boolean isOK(int i);
    }

    static class Filter1 implements Filter {
        @Override
        public boolean isOK(int i) {
            return i % 3 == 1;
        }
    }

    static class Filter2 implements Filter {
        @Override
        public boolean isOK(int i) {
            return i % 7 == 3;
        }
    }

    static class FilterChain1 implements Filter {
        final Filter[] filters = createLeafFilters();

        @Override
        public boolean isOK(int i) {
            for (int j = 0; j < filters.length; ++j) {
                if (!filters[j].isOK(i)) {
                    return false;
                }
            }
            return true;
        }
    }

    static class FilterChain2 implements Filter {
        final Filter[] filters = createLeafFilters();

        @Override
        public boolean isOK(int i) {
            return filters[0].isOK(i) && filters[1].isOK(i);
        }
    }

    private static Filter[] createLeafFilters() {
        Filter[] filters = new Filter[2];
        filters[0] = new Filter1();
        filters[1] = new Filter2();
        return filters;
    }

    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(args);
    }
}

684

asked Nov 22 '19 13:11

Alexander

2 Answers

TL;DR The main reason of performance difference here is not related to loop unrolling. It is rather the type speculation and the inline caches.

Unrolling strategies

In fact, in HotSpot terminology, such loops are treated as counted, and in certain cases JVM can unroll them. Not in your case though.

HotSpot has two loop unrolling strategies: 1) unroll maximally, i.e. remove the loop altogether; or 2) glue several consecutive iterations together.

Maximal unrolling can be done, only if the exact number of iterations is known.

  if (!cl->has_exact_trip_count()) {
    // Trip count is not exact.
    return false;
  }

In your case, however, the function may return early after the first iteration.

Partial unrolling could be probably applied, but the following condition breaks unrolling:

  // Don't unroll if the next round of unrolling would push us
  // over the expected trip count of the loop.  One is subtracted
  // from the expected trip count because the pre-loop normally
  // executes 1 iteration.
  if (UnrollLimitForProfileCheck > 0 &&
      cl->profile_trip_cnt() != COUNT_UNKNOWN &&
      future_unroll_ct        > UnrollLimitForProfileCheck &&
      (float)future_unroll_ct > cl->profile_trip_cnt() - 1.0) {
    return false;
  }

Since in your case the expected trip count is less than 2, HotSpot assumes it's not worthy to unroll even two iterations. Note that the first iteration is extracted into pre-loop anyway (loop peeling optimization), so unrolling is indeed not very benificial here.

Type speculation

In your unrolled version, there are two different invokeinterface bytecodes. These sites have two distinct type profiles. The first receiver is always Filter1, and the second receiver is always Filter2. So, you basically have two monomorphic call sites, and HotSpot can perfectly inline both calls - so called "inline cache" which has 100% hit ratio in this case.

With the loop, there is just one invokeinterface bytecode, and only one type profile is collected. HotSpot JVM sees that filters[j].isOK() is called 86% times with Filter1 receiver and 14% times with Filter2 receiver. This will be a bimorphic call. Fortunately, HotSpot can speculatively inline bimorphic calls, too. It inlines both targets with a conditional branch. However, in this case the hit ratio will be at most 86%, and the performance will suffer from the corresponding mispredicted branches at the architecture level.

Things will be even worse, if you have 3 or more different filters. In this case isOK() will be a megamorphic call which HotSpot cannot inline at all. So, the compiled code will contain a true interface call which has a larger performance impact.

More about speculative inlining in the article The Black Magic of (Java) Method Dispatch.

Conclusion

In order to inline virtual/interface calls, HotSpot JVM collects type profiles per invoke bytecode. If there is a virtual call in a loop, there will be just one type profile for the call, no matter if the loop is unrolled or not.

To get the best from the virtual call optimizations, you'd need to manually split the loop, primarily for the purpose of splitting type profiles. HotSpot cannot do this automatically so far.

107

answered Oct 23 '22 08:10

apangin

The loop presented likely falls under the "non counted" category of loops, which are loops for which the iteration count can neither be determined at compile time nor at run time. Not only because of @Andreas argument about the array size but also because of the randomly conditional break (that used to be in your benchmark when I wrote this post).

State-of-the-art compilers do not aggressively optimize them, since unrolling non-counted loops often involves duplicating also a loop’s exit condition, which thus only improves run-time performance if subsequent compiler optimizations can optimize the unrolled code. See this 2017 paper for details where they make proposals how to unroll such stuff too.

From this follows, that your assumption does not hold that you did sort of "manual unrolling" of the loop. You're considering it a basic loop unrolling technique to transform an iteration over an array with conditional break to an && chained boolean expression. I'd consider this a rather special case and would be surprised to find a hot-spot optimizer do a complex refactoring on the fly. Here they're discussing what it actually might do, perhaps this reference is interesting.

This would reflect closer the mechanics of a contemporary unrolling and is perhaps still nowhere near what unrolled machine code would look like:

if (! filters[0].isOK(i))
{
   return false;
} 
if(! filters[1].isOK(i))
{
   return false;
}
return true;

You're concluding, that because one piece of code runs faster than another piece of code the loop didn't unroll. Even if it did, you still could see the runtime difference due to the fact that you're comparing different implementations.

If you want to gain more certainty, there's the jitwatch analyzer/visualizer of the actual Jit operations including machine code (github) (presentation slides). If there's something to see eventually I'd trust my own eyes more than any opinion about what JIT may or may not do in general, since every case has its specifics. Here they fret about the difficulty to arrive at general statements for specific cases as far as JIT is concerned and provide some interesting links.

Since your goal is minimum runtime, the a && b && c ... form is likely the most efficient one, if you don't want to depend on hope for loop-unrolling, at least more efficient than anything else presented yet. But you can't have that in a generic way. With functional composition of java.util.Function there's huge overhead again (each Function is a class, each call is a virtual method that needs dispatch). Perhaps in such a scenario it might make sense to subvert the language level and generate custom byte code at runtime. On the other hand a && logic requires branching in byte code level as well and may be equivalent to if/return (which also can't be generified without overhead).

answered Oct 23 '22 09:10

Curiosa Globunznik

Related questions
                            
                                Guice @Provides Methods vs Provider Classes
                            
                                android.support.v4.app.FragmentPagerAdapter cannot be applied to android.app.FragmentManager
                            
                                Lower case enum Gson
                            
                                Java8 Lambda expression to iterate over enum values and initialize final member
                            
                                Logback Logging - Synchronous or Asynchronous
                            
                                Spring Boot extending CrudRepository
                            
                                Handling null pointers and throwing exceptions in streams
                            
                                java.lang.NoSuchFieldError android/support/v7/AppCompat/
                            
                                How to get the first key of a hashmap?
                            
                                Generic: why use "class A <E extends Superclass>" in stead of "class B<Superclass>"?
                            
                                Java 8 - Filter with BiPredicate
                            
                                In JPA, relational databases and etc., What is a Tuple? [closed]
                            
                                Method reference in Java 8
                            
                                Error creating bean with name 'requestMappingHandlerAdapter'
                            
                                Eclipse Neon - Content Assist Timing Out
                            
                                How to make a Type 5 UUID in Java?
                            
                                How to convert the following JSON String to POJO
                            
                                Spring Boot + Thymeleaf ERROR java.lang.ClassNotFoundException: org.thymeleaf.dom.Attribute
                            
                                Android Studio: Failed to create MD5 HashFile
                            
                                Spring: Different exception handler for RestController and Controller

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With