Why is byte addition performance so unpredictable?

Question

A few hours ago I answered another Stack Overflow question, and it gave a very surprising result. The answer can be found here. The answer was/is partially wrong, however I feel focused on byte addition.

Strictly, it actually is byte-to-long addition.

This is the benchmark code I have been using:

public class ByteAdditionBenchmark {
    private void start() {
        int[] sizes = {
            700_000,
            1_000,
            10_000,
            25_000,
            50_000,
            100_000,
            200_000,
            300_000,
            400_000,
            500_000,
            600_000,
            700_000,
        };

        for (int size : sizes) {
            List<byte[]> arrays = createByteArrays(size);
            //Warmup
            arrays.forEach(this::byteArrayCheck);
            benchmark(arrays, this::byteArrayCheck, "byteArrayCheck");
        }
    }

    private void benchmark(final List<byte[]> arrays, final Consumer<byte[]> method, final String name) {
        long start = System.nanoTime();
        arrays.forEach(method);
        long end = System.nanoTime();
        double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
        System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + " ns");
    }

    private List<byte[]> createByteArrays(final int amount) {
        Random random = new Random();
        List<byte[]> resultList = new ArrayList<>();
        for (int i = 0; i < amount; i++) {
            byte[] byteArray = new byte[4096];
            byteArray[random.nextInt(4096)] = 1;
            resultList.add(byteArray);
        }
        return resultList;
    }

    private boolean byteArrayCheck(final byte[] array) {
        long sum = 0L;
        for (byte b : array) {
            sum += b;
        }
        return (sum == 0);
    }

    public static void main(String[] args) {
        new ByteAdditionBenchmark().start();
    }
}

And this are the results I have been getting:

Benchmark: byteArrayCheck / iterations: 700000 / time per iteration: 50.26538857142857 ns
Benchmark: byteArrayCheck / iterations: 1000 / time per iteration: 20.12 ns
Benchmark: byteArrayCheck / iterations: 10000 / time per iteration: 9.1289 ns
Benchmark: byteArrayCheck / iterations: 25000 / time per iteration: 10.02972 ns
Benchmark: byteArrayCheck / iterations: 50000 / time per iteration: 9.04478 ns
Benchmark: byteArrayCheck / iterations: 100000 / time per iteration: 18.44992 ns
Benchmark: byteArrayCheck / iterations: 200000 / time per iteration: 15.48304 ns
Benchmark: byteArrayCheck / iterations: 300000 / time per iteration: 15.806353333333334 ns
Benchmark: byteArrayCheck / iterations: 400000 / time per iteration: 16.923685 ns
Benchmark: byteArrayCheck / iterations: 500000 / time per iteration: 16.131066 ns
Benchmark: byteArrayCheck / iterations: 600000 / time per iteration: 16.435461666666665 ns
Benchmark: byteArrayCheck / iterations: 700000 / time per iteration: 17.107615714285714 ns

To my knowledge, the JVM is already fully warmed up after the first 700000 iterations before it starts to spit out benchmarking data.

How can it then be that, despite the warmup, the performance is still unpredictable? As almost directly after the warmup byte addition gets blazingly fast, but after that it seems to converge again to a nominal 16 ns per addition again.

The tests have been ran on a PC with an Intel i7 3770 stock-clocked and 16 GB of RAM, hence I can't go much beyond 700000 iterations. It is running on Windows 8.1 64-bit if that matters.

It turns out that the JIT was optimizing everything away, as per raphw's suggestion.

Therefore I replaced the benchmark method with the following:

private void benchmark(final List<byte[]> arrays, final Predicate<byte[]> method, final String name) {
    long start = System.nanoTime();
    boolean someUnrelatedResult = false;
    for (byte[] array : arrays) {
        someUnrelatedResult |= method.test(array);
    }
    long end = System.nanoTime();
    double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
    System.out.println("Result: " + someUnrelatedResult);
    System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
}

This will ensure that it cannot be optimized away and test results also show it (omitted the result print for clarity):

Benchmark: byteArrayCheck / iterations: 700000 / time per iteration: 1658.2627914285715 ns
Benchmark: byteArrayCheck / iterations: 1000 / time per iteration: 1241.706 ns
Benchmark: byteArrayCheck / iterations: 10000 / time per iteration: 1215.941 ns
Benchmark: byteArrayCheck / iterations: 25000 / time per iteration: 1332.94656 ns
Benchmark: byteArrayCheck / iterations: 50000 / time per iteration: 1456.0361 ns
Benchmark: byteArrayCheck / iterations: 100000 / time per iteration: 1753.26777 ns
Benchmark: byteArrayCheck / iterations: 200000 / time per iteration: 1756.93283 ns
Benchmark: byteArrayCheck / iterations: 300000 / time per iteration: 1762.9992266666666 ns
Benchmark: byteArrayCheck / iterations: 400000 / time per iteration: 1806.854815 ns
Benchmark: byteArrayCheck / iterations: 500000 / time per iteration: 1784.09091 ns
Benchmark: byteArrayCheck / iterations: 600000 / time per iteration: 1804.6096366666666 ns
Benchmark: byteArrayCheck / iterations: 700000 / time per iteration: 1811.0597585714286 ns

I'd say that these results look a lot more convincing with respect to the computation time. However, my question still stands. With repeated tests at random times the same pattern remains that benchmarks with a low number of iterations are faster than those with more iterations, though they do seem to stabilize at 100,000 iterations or somewhere lower.

What is the explanation?

Rafael Winterhalter · Accepted Answer

The reason for your result is that you do not actually know what you are measuring. Java's just-in-time compiler is most certainly having a look at your code and it might just happen that you are measuring nothing.

The compiler is smart enough to figure out that your List<byte[]> is not actually used for anything. Therefore, it will eventually remove all the related code from your running application. Thus, your benchmark is most likely measuring an increasingly empty application.

The answer to all such questions is always: It is not worth having a discussion before we actually looked at a valid benchmark. Benchmarking harnesses such as the JMH (which I can recommend) know a concept called a black hole. Black holes are meant to confuse the just-in-time compiler in order to think that a computed value is actually used for something even though it is not. With such black holes, otherwise code that is erased as no-op will remain.

Another typical problem of home-grown benchmarks are optimized loops. Again, the just-in-time compiler will notice that the loop results in the same computation for any iteration and will therefore remove the loop altogether. With a (quality) benchmarking harness, you will only suggest a number of loops to run instead of hard-coding them. This way, the harness can take care of tricking the compiler.

Write a benchmark with JMH, you will see that your measured times will be quite different.

Concerning your update: I can only repeat myself. Never trust an unharnessed benchmark! An easy way to find out about what the JVM is doing to your code is running JITwatch. The main concern with your benchmark is that it ignores the JVM's profiling. A profile is an attempt of the JVM to remember properties of your code which it then bases its optimization on. For your benchmark, you mix the profiles of different runs together. The JVM then has to renew its current profile and recompile the byte code on the fly what costs time.

In order to avoid this issue, a harnesses like JMH lets you fork a JVM new process for each benchmark. Here is what I am measuring with a harnessed benchmark:

Benchmark                    Mode   Samples         Mean   Mean error    Units
o.s.MyBenchmark.test100k     avgt        20     1922.671       29.155    ns/op
o.s.MyBenchmark.test10k      avgt        20     1911.152       13.217    ns/op
o.s.MyBenchmark.test1k       avgt        20     1857.205        3.086    ns/op
o.s.MyBenchmark.test200k     avgt        20     1905.360       18.102    ns/op
o.s.MyBenchmark.test25k      avgt        20     1832.663      102.562    ns/op
o.s.MyBenchmark.test50k      avgt        20     1907.488       18.043    ns/op

And here is the source code to the benchmark based on the mentioned JMH:

@State(Scope.Benchmark)
public class MyBenchmark {

    private List<byte[]> input1k, input10k, input25k, input50k, input100k, input200k;

    @Setup
    public void setUp() {
        input1k = createByteArray(1_000);
        input10k = createByteArray(10_000);
        input25k = createByteArray(25_000);
        input50k = createByteArray(50_000);
        input100k = createByteArray(100_000);
        input200k = createByteArray(200_000);
    }

    private static List<byte[]> createByteArray(int length) {
        Random random = new Random();
        List<byte[]> resultList = new ArrayList<>();
        for (int i = 0; i < length; i++) {
            byte[] byteArray = new byte[4096];
            byteArray[random.nextInt(4096)] = 1;
            resultList.add(byteArray);
        }
        return resultList;
    }

    @GenerateMicroBenchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @OperationsPerInvocation(1_000)
    public boolean test1k() {
        return runBenchmark(input1k, this::byteArrayCheck);
    }

    @GenerateMicroBenchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @OperationsPerInvocation(10_000)
    public boolean test10k() {
        return runBenchmark(input10k, this::byteArrayCheck);
    }

    @GenerateMicroBenchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @OperationsPerInvocation(25_000)
    public boolean test25k() {
        return runBenchmark(input25k, this::byteArrayCheck);
    }

    @GenerateMicroBenchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @OperationsPerInvocation(50_000)
    public boolean test50k() {
        return runBenchmark(input50k, this::byteArrayCheck);
    }

    @GenerateMicroBenchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @OperationsPerInvocation(100_000)
    public boolean test100k() {
        return runBenchmark(input100k, this::byteArrayCheck);
    }

    @GenerateMicroBenchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @OperationsPerInvocation(200_000)
    public boolean test200k() {
        return runBenchmark(input200k, this::byteArrayCheck);
    }

    private static boolean runBenchmark(List<byte[]> arrays, Predicate<byte[]> method) {
        boolean someUnrelatedResult = false;
        for (byte[] array : arrays) {
            someUnrelatedResult |= method.test(array);
        }
        return someUnrelatedResult;
    }

    private boolean byteArrayCheck(final byte[] array) {
        long sum = 0L;
        for (byte b : array) {
            sum += b;
        }
        return (sum == 0);
    }

    public static void main(String[] args) throws RunnerException {
        new Runner(new OptionsBuilder()
                .include(".*" + MyBenchmark.class.getSimpleName() + ".*")
                .forks(1)
                .build()).run();
    }
}

Why is byte addition performance so unpredictable?

Tags:

java

benchmarking

skiwi

1 Answers

Rafael Winterhalter

Recent Activity

Donate For Us

Why is byte addition performance so unpredictable?

Tags:

java

benchmarking

skiwi

1 Answers

Rafael Winterhalter

Related questions

Recent Activity

Donate For Us