The following three pieces of code achieves exactly the same effect. Yet, when compiled with -O3 on GCC 4.5.2 the times for a lot of iterations vary quite markedly.
// a, b, c, d are set to random values 0-255 before each iteration.
if (a < 16 or b < 32 or c < 64 or d < 128) result += a+b+c+d;
if (a < 16 | b < 32 | c < 64 | d < 128) result += a+b+c+d;
result += (a+b+c+d) * (a < 16 | b < 32 | c < 64 | d < 128);
The above times are the best for each method when run as the the inner loop of a benchmark program I made. The random()
is seeded the same way before each run.
Before I made this benchmark I assumed GCC would optimize away the differences. Especially the 2nd example makes me scratch my head. Can anyone explain why GCC doesn't turn code like this into equivalent faster code?
EDIT: Fixed some errors, and also made it clear that the random numbers are created regardless, and used, so as to not be optimized away. They always were in the original benchmark, I just botched the code I put on here.
Here is an example of an actual benchmark function:
boost::random::mt19937 rng;
boost::random::uniform_int_distribution<> ranchar(0, 255);
double quadruple_or(uint64_t runs) {
uint64_t result = 0;
rng.seed(0);
boost::chrono::high_resolution_clock::time_point start =
boost::chrono::high_resolution_clock::now();
for (; runs; runs--) {
int a = ranchar(rng);
int b = ranchar(rng);
int c = ranchar(rng);
int d = ranchar(rng);
if (a < 16 or b < 32 or c < 64 or d < 128) result += a;
if (d > 16 or c > 32 or b > 64 or a > 128) result += b;
if (a < 96 or b < 53 or c < 199 or d < 177) result += c;
if (d > 66 or c > 35 or b > 99 or a > 77) result += d;
}
// Force gcc to not optimize away result.
std::cout << "Result check " << result << std::endl;
boost::chrono::duration<double> sec =
boost::chrono::high_resolution_clock::now() - start;
return sec.count();
}
The full benchmark can be found here.
The OP has changed a bit since my original answer. Let me try to revisit here.
In case 1, because of or
short-circuiting I would expect the compiler to generate four compare-then-branch code sections. Branches can obviously be pretty expensive especially if they don't go the predicted path.
In case 2, the compiler can decide to do all four comparisons, convert them to bool 0/1 results, and then bitwise or
all four pieces together, then doing a single (additional) branch. This trades possibly more comparisons for possibly fewer branches. It appears as if reducing the number of branches does improve performance.
In case 3, things work pretty much the same as 2 except at the very end one more branch may be eliminated by explicitly telling the compiler "I know the result will be zero or one, so just multiply the thing on the left by that value". The multiply apparently comes out faster than the corresponding branch on your hardware. This is in contrast to the second example where the compiler doesn't know the range of possible outputs from the bitwise or
so it has to assume it could be any integer and must do a compare-and-jump instead.
Original answer for history:
The first case is functionally different from the second and third if random
has side effects (which a normal PRNG would), so it stands to reason that the compiler may optimize them differently. Specifically, the first case will only call random
as many times as needed to pass the check while in the other two cases random
will always be called four times. This will (assuming random
really is stateful) result in the future random numbers being different.
The difference between the second and third is because the compiler probably can't figure out for some reason that the result of the bitwise or will always be 0 or 1. When you give it a hint to do the multiplication instead of branching the multiplication probably comes out faster due to pipelining.
With logical operators the code will branch and early-out. Bitwise operators always do the full job.
Branch prediction will be worse in the first case, although it will outperform the bitwise case for larger examples.
It cannot optimise away random()
because that function is not pure (idempotent).
You can always try optimizing away the branch and multiply. Instead of:
if (test) result+= blah;
or
result+= blah*(test);
You can do:
result+= blah&(-(test));
If test
is false, -false==0
and (blah&0)==0
. If test
is true, -true==~0
and (blah&~0)==blah
. You may have to fiddle with test
as !!test
to ensure true==1
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With