Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange branching performance

Tags:

The results of my benchmark shows that the performance is worst when the branch has 15% (or 85%) percent probability rather than 50%.

Any explanation?

img

The code is too long but the relevant parts are here:

private int diff(char c) {
    return TABLE[(145538857 * c) >>> 27] - c;
}

@Benchmark int timeBranching(int reps) {
    int result = 0;
    while (reps-->0) {
        for (final char c : queries) {
            if (diff(c) == 0) {
                ++result;
            }
        }
    }
    return result;
}

It counts the number of BREAKING_WHITESPACE characters in the given string. The results shows a sudden time drop (performance increase) when the branching probability reaches about 0.20.

More details about the drop. Varying the seed shows more strange things happening. Note that the black line denoting the minimum and maximum values is very short except when close to the cliff.

cliff

like image 342
maaartinus Avatar asked Oct 30 '13 17:10

maaartinus


People also ask

How much faster is branchless code?

So, the branchless version is almost twice as fast as the branching version on my system (3.4 GHz. Intel Core i7).

Why are branch instructions slow?

Because CPU adopts pipeline to execute instructions, which means when a previous instruction is being executed at some stage (for example, reading values from registers), the next instruction will get executed at the same time, but at another stage (for example, decoding stage).

How do I stop branching code?

I believe the most common way to avoid branching is to leverage bit parallelism in reducing the total jumps present in your code. The longer the basic blocks, the less often the pipeline is flushed.

What is branchless programming?

Branchless programming is a programming technique that eliminates the branches (if, switch, and other conditional statements) from the program. Although this is not much relevant these days with extremely powerful systems and usage of interpreted languages( especially dynamic typed ones).


1 Answers

It looks like a minor JIT bug. For a small branch probability, it generates something like the following, just much more complicated due to unrolling (I'm simplifying a lot):

movzwl 0x16(%r8,%r10,2),%r9d

Get the char: int r9d = queries[r10]

imul   $0x8acbf29,%r9d,%ebx

Multiply: ebx = 145538857 * r9d

shr    $0x1b,%ebx

Shift: ebx >>>= 27

cmp    %edx,%ebx
jae    somewhere

Check bounds: if (ebx > edx || ebx < 0) goto somewhere (and throw there an IndexOutOfBoundsException.

cmp    %r9d,%ebx
jne    back

Jump back to loop beginning if not equal: if (r9d != ebx) goto back

inc    %eax

Increment the result: eax++

jne    back

Simply goto back

We can see one smart and one dumb thing:

  • The bound check gets done optimally with a single unsigned comparison.
  • It's completely redundant as x >>> 27 is always positive and less than the table length (32).

For a branch probability above 20%, these three instructions

cmp    %r9d,%ebx
jne    back
inc    %eax

get replaced by something like

mov    %eax,%ecx   // ecx = result
inc    %ecx        // ecx++
cmp    %r9d,%ebx   // if (r9b == index)
cmoveq %ecx,%ebx   //    result = ecx

using a conditional move instruction. This is one instruction more, but no branching and thus no branch misprediction penalty.

This explains the very constant time in the 20-80% range. The ramping below 20% is clearly due to branch mispredictions.

So it looks like a JIT failure to use the proper threshold of about 0.04 rather than 0.18.

thr

like image 158
maaartinus Avatar answered Oct 19 '22 14:10

maaartinus