Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is branch prediction not working?

In reference to this question, the answer specify that the unsorted array takes more time because it fails the branch prediction test. but if we make a minor change in the program:

import java.util.Arrays;
import java.util.Random;


public class Main{

    public static void main(String[] args) {
        // Generate data
        int arraySize = 32768;
        int data[] = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c) {
            data[c] = rnd.nextInt() % 256;
        }

        // !!! With this, the next loop runs faster
        Arrays.sort(data);

        // Test
        long start = System.nanoTime();
        long sum = 0;

        for (int i = 0; i < 100000; ++i) {
            // Primary loop
            for (int c = 0; c < arraySize; ++c) {
                if (data[c] >= 128) {
                    sum = data[c];
                }
            }
        }

        System.out.println((System.nanoTime() - start) / 1000000000.0);
        System.out.println("sum = " + sum);
    }
}

here I have replaced (from original question)

if (data[c] >= 128) 
    sum += data[c];

with

if (data[c] >= 128) 
    sum = data[c];

the unsorted array gives approx. the same result, I want to ask why branch prediction is not working in this case?

like image 256
MYK Avatar asked Jan 29 '14 13:01

MYK


People also ask

Is branch prediction still used?

But the dominant CPU designs today rely on dynamic branch prediction. This technique is able to mostly avoid the frontend bubble problem, by predicting the correct address of the next instruction even for branches that aren't fully decoded and executed yet.

What happens if branch prediction is wrong?

If it guessed wrong it has to undo anything the speculatively executed instructions might have done, clear the pipeline and begin sending down the instructions it was supposed be executing. This results in the same delay as if it hadn't made a prediction at all.

How accurate are branch predictors?

On the SPEC'89 benchmarks, very large bimodal predictors saturate at 93.5% correct, once every branch maps to a unique counter. The predictor table is indexed with the instruction address bits, so that the processor can fetch a prediction for every instruction before the instruction is decoded.

How accurate are today's CPUs at branch prediction?

This scheme was implemented in the MIPS R10000 processor and the results showed prediction accuracy of ~90%.


1 Answers

I have used jmh to analyze this. Here is my code:

@OutputTimeUnit(TimeUnit.MICROSECONDS)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 2, time = 1)
@Measurement(iterations = 3, time = 1)
@State(Scope.Thread)
@Fork(2)
public class Comparison
{
  static final int SIZE = 1<<15;
  final int[] data = new int[SIZE];

  @Setup
  public void setup() {
    int i = 1;
    for (int c = 0; c < SIZE; ++c) data[c] = (i*=611953);
    for (int c = 0; c < SIZE; ++c) data[c] = data[c] >= 128? 128 : 127;
  }

  @GenerateMicroBenchmark
  public long sum() {
    long sum = 0;
    for (int c = 0; c < SIZE; ++c) if (data[c] >= 128) sum += data[c];
    return sum;
  }
}

Notice I don't use either sorting or random number generation; they are an unnecessary complication. With the formula used in the above code:

data[c] = (i*=611953);

I get 132 µs runtime. If I comment out the line involving

data[c] = data[c] >= 128? 128 : 127;

the time doesn't change at all. This eliminates all arithmetic considerations and focuses on branch prediction. If I use

data[c] = 127;

I get 13 µs, and if I use

data[c] = 128;

I get 16 µs. That's the "base case", stressing the difference between constant branching decisions.

My conclusion: this is definitely the effect of low-level branch prediction.

Could the JIT reverse the loop?

Let's analyze your intervention now. If I use the formula as presented in my code above, but change

if (data[c] >= 128) sum += data[c];

to

if (data[c] >= 128) sum = data[c];

then the timing indeed drops from 132 µs to 27 µs.

This is my guess at explaining the drop: an optimizing trick the JIT compiler can do is to reverse the direction of the loop. Now your code becomes

for (int c = SIZE-1; c <= 0; --c) if (data[c] >= 128) { sum = data[c]; break; }

the loop has been short-circuited to the minimum number of iterations needed to reach the same outcome as the original loop.

I added this

data[SIZE-1] = 128;

to the end of the setup() method, but it didn't change the timing. That would seem to invalidate the naïve version of the "loop reversal" conjecture.

No, it's probably cmovl

In analyzing assembly I find this:

cmp edx, 0x80
cmovl eax, ebx

cmovl is a conditional move instruction which will perform the effect of the assignment happening in the then branch, but without involving any jumps, therefore eliminating any penalty associated with branch prediction failure. This is a good explanation of the actual effect.

like image 174
Marko Topolnik Avatar answered Oct 08 '22 02:10

Marko Topolnik