Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it more efficient to branch or multiply?

I am trying to optimize a small, highly used function which uses the high bits in an unsigned short int to indicate the values of an array to sum together. At first I was using the obvious approach shown below. Please note that loop unrolling is not explicitly shown as it should be done by the compiler.

int total = 0;
for(unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++){
    if (i & mask){
        total += value[j];
    }
}

However, later I thought it might be better to remove the branching to help CPU pipelining and came up with the following.

int total = 0;
for(unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++){
    total += ((i & mask) != 0) * value[j];
}

Note that since (i & mask) does not result in a boolean answer, the comparison with 0 forces the result to be either 1 or 0. Although this second approach eliminates the if-statement from this section of the code, the second solution needs to run a multiplication of 0 or 1 on every iteration in addition to the rest of the equation.

Which code will run faster?

like image 699
Nixuz Avatar asked Feb 05 '09 05:02

Nixuz


4 Answers

You could make it branchless without a multiply. It looks like for each bit set you are using that bit position as an index into an array.

First, you can easily extract bits set with:

unsigned short set_mask= i & -i;
i&= i - 1;

Then, you can get the bit index by counting the bits set in (set_mask - 1). There's a constant time formula for this.

Some platforms also have an intrinsic to get the bit index of a bit set which is probably faster. x86 has bsr, PPC has cntlz.

So the answer is the branchless multiplyless version is probably fastest :)

like image 100
MSN Avatar answered Sep 21 '22 13:09

MSN


This depends totally on the compiler, machine instruction set and, probably, phase of the moon.

There is no specific correct answer because of this. If you really want to know, check the assembly output from the compiler.

From a simplistic point of view, I would say that the second is slower since it involves all the calculations of the first plus a multiply. But the compiler may well be smart enough to optimize that away.

So the correct answer is: it depends.

like image 30
paxdiablo Avatar answered Sep 23 '22 13:09

paxdiablo


Which code will run faster?

Test it to find out.

Also, look at the assembly-language version of the code which the compiler emits, because there you might see things in it that surprise you, and which hint at further optimizations (for example, using short as you are using can need more instructions that using the machine's natural integer size).

like image 43
ChrisW Avatar answered Sep 23 '22 13:09

ChrisW


Either could be faster. For some processors, the actual input data may change the answer. You will need to profile both approaches with real data. Here are some things that can affect actual performance on x86 hardware.

Let's assume for the moment that you're using a late-model Pentium 4. That processor has two levels of branch predictors baked into the CPU. If the branch predictors can guess correctly the branch direction, I suspect that the first will be fastest. This is most likely to occur if the flags are nearly all the same value or if they alternate in a very simple pattern most of the time. If the flags are truly random, then the branch predictor will be wrong half the time. For our hypothetical 32-stage Pentium 4, this will kill performance. For Pentium 3 chips, Core 2 chips, Core i7, and most AMD chips, the pipelines are shorter, so the cost of bad branch prediction is much lower.

If your value vector is noticeably larger than the processor's cache, then either approach will be limited by memory bandwidth. They'll both have essentially identical performance characteristics. If the value vector fits comfortably in cache, be careful how you do any profiling so that one of the test loops isn't getting penalized for filling the cache and the other one benefits from it.

like image 32
Mr Fooz Avatar answered Sep 23 '22 13:09

Mr Fooz