In reference to this question, the answer specify that the unsorted array takes more time because it fails the branch prediction test. but if we make a minor change in the program:
import java.util.Arrays;
import java.util.Random;
public class Main{
public static void main(String[] args) {
// Generate data
int arraySize = 32768;
int data[] = new int[arraySize];
Random rnd = new Random(0);
for (int c = 0; c < arraySize; ++c) {
data[c] = rnd.nextInt() % 256;
}
// !!! With this, the next loop runs faster
Arrays.sort(data);
// Test
long start = System.nanoTime();
long sum = 0;
for (int i = 0; i < 100000; ++i) {
// Primary loop
for (int c = 0; c < arraySize; ++c) {
if (data[c] >= 128) {
sum = data[c];
}
}
}
System.out.println((System.nanoTime() - start) / 1000000000.0);
System.out.println("sum = " + sum);
}
}
here I have replaced (from original question)
if (data[c] >= 128)
sum += data[c];
with
if (data[c] >= 128)
sum = data[c];
the unsorted array gives approx. the same result, I want to ask why branch prediction is not working in this case?
But the dominant CPU designs today rely on dynamic branch prediction. This technique is able to mostly avoid the frontend bubble problem, by predicting the correct address of the next instruction even for branches that aren't fully decoded and executed yet.
If it guessed wrong it has to undo anything the speculatively executed instructions might have done, clear the pipeline and begin sending down the instructions it was supposed be executing. This results in the same delay as if it hadn't made a prediction at all.
On the SPEC'89 benchmarks, very large bimodal predictors saturate at 93.5% correct, once every branch maps to a unique counter. The predictor table is indexed with the instruction address bits, so that the processor can fetch a prediction for every instruction before the instruction is decoded.
This scheme was implemented in the MIPS R10000 processor and the results showed prediction accuracy of ~90%.
I have used jmh to analyze this. Here is my code:
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 2, time = 1)
@Measurement(iterations = 3, time = 1)
@State(Scope.Thread)
@Fork(2)
public class Comparison
{
static final int SIZE = 1<<15;
final int[] data = new int[SIZE];
@Setup
public void setup() {
int i = 1;
for (int c = 0; c < SIZE; ++c) data[c] = (i*=611953);
for (int c = 0; c < SIZE; ++c) data[c] = data[c] >= 128? 128 : 127;
}
@GenerateMicroBenchmark
public long sum() {
long sum = 0;
for (int c = 0; c < SIZE; ++c) if (data[c] >= 128) sum += data[c];
return sum;
}
}
Notice I don't use either sorting or random number generation; they are an unnecessary complication. With the formula used in the above code:
data[c] = (i*=611953);
I get 132 µs runtime. If I comment out the line involving
data[c] = data[c] >= 128? 128 : 127;
the time doesn't change at all. This eliminates all arithmetic considerations and focuses on branch prediction. If I use
data[c] = 127;
I get 13 µs, and if I use
data[c] = 128;
I get 16 µs. That's the "base case", stressing the difference between constant branching decisions.
My conclusion: this is definitely the effect of low-level branch prediction.
Let's analyze your intervention now. If I use the formula as presented in my code above, but change
if (data[c] >= 128) sum += data[c];
to
if (data[c] >= 128) sum = data[c];
then the timing indeed drops from 132 µs to 27 µs.
This is my guess at explaining the drop: an optimizing trick the JIT compiler can do is to reverse the direction of the loop. Now your code becomes
for (int c = SIZE-1; c <= 0; --c) if (data[c] >= 128) { sum = data[c]; break; }
the loop has been short-circuited to the minimum number of iterations needed to reach the same outcome as the original loop.
I added this
data[SIZE-1] = 128;
to the end of the setup()
method, but it didn't change the timing. That would seem to invalidate the naïve version of the "loop reversal" conjecture.
cmovl
In analyzing assembly I find this:
cmp edx, 0x80
cmovl eax, ebx
cmovl
is a conditional move instruction which will perform the effect of the assignment happening in the then
branch, but without involving any jumps, therefore eliminating any penalty associated with branch prediction failure. This is a good explanation of the actual effect.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With