I conducted a simple experiment, to compare if-else to only an if (with default values preset). Example:
void test0(char c, int *x) {
*x = 0;
if (c == 99) {
*x = 15;
}
}
void test1(char c, int *x) {
if (c == 99) {
*x = 15;
} else {
*x = 0;
}
}
For the functions above, I got the exact same assembly code (using cmovne
).
However when adding an extra variable:
void test2(char c, int *x, int *y) {
*x = 0;
*y = 0;
if (c == 99) {
*x = 15;
*y = 21;
}
}
void test3(char c, int *x, int *y) {
if (c == 99) {
*x = 15;
*y = 21;
} else {
*x = 0;
*y = 0;
}
}
The assembly suddenly becomes different:
test2(char, int*, int*):
cmp dil, 99
mov DWORD PTR [rsi], 0
mov DWORD PTR [rdx], 0
je .L10
rep ret
.L10:
mov DWORD PTR [rsi], 15
mov DWORD PTR [rdx], 21
ret
test3(char, int*, int*):
cmp dil, 99
je .L14
mov DWORD PTR [rsi], 0
mov DWORD PTR [rdx], 0
ret
.L14:
mov DWORD PTR [rsi], 15
mov DWORD PTR [rdx], 21
ret
It seems that the only difference is if the top mov
s are done before or after the je
.
Now (sorry my assembly is a bit crude), isn't it always better to have the mov
s after the jump, in order to save pipeline flushes? And if so why wouldn't the optimizer (gcc6.2 -O3) use the better method?
In general, "else if" style can be faster because in the series of ifs, every condition is checked one after the other; in an "else if" chain, once one condition is matched, the rest are bypassed.
Conclusion: If only is faster than If-Else construct.
Switch is generally faster than a long list of ifs because the compiler can generate a jump table. The longer the list, the better a switch statement is over a series of if statements.
An if-else statement can evaluate almost all the types of data such as integer, floating-point, character, pointer, or Boolean. A switch statement can evaluate either an integer or a character. In the case of 'if-else' statement, either the 'if' block or the 'else' block will be executed based on the condition.
For the functions above, I got the exact same assembly code (using cmovne).
Sure, some compilers may make that optimization, but it is not guaranteed. It is very possible that you will get different object code for those two ways of writing the function.
In fact, no optimization is guaranteed (although modern optimizing compilers do an impressive job most of the time), so you should either write the code to capture the semantic meaning you intend for it to have, or you should verify the generated object code and write the code to ensure that you are getting the expected output.
Here is what older versions of MSVC will generate when targeting x86-32 (primarily because they don't know to use the CMOV instruction):
test0 PROC
cmp BYTE PTR [c], 99
mov eax, DWORD PTR [x]
mov DWORD PTR [eax], 0
jne SHORT LN2
mov DWORD PTR [eax], 15
LN2:
ret 0
test0 ENDP
test1 PROC
mov eax, DWORD PTR [x]
xor ecx, ecx
cmp BYTE PTR [c], 99
setne cl
dec ecx
and ecx, 15
mov DWORD PTR [eax], ecx
ret 0
test1 ENDP
Note that test1
gives you branchless code that utilizes the SETNE
instruction (a conditional set, which will set its operand to 0 or 1 based on the condition code—in this case, NE
) in conjunction with some bit-manipulation to produce the correct value. test0
uses a conditional branch to skip over the assignment of 15 to *x
.
The reason this is interesting is because it is almost exactly the opposite of what you might expect. Naïvely, one might expect that test0
would be the way you'd hold the optimizer's hand and get it to generate branchless code. At least, that's the first thought that went through my head. But in fact, that is not the case! The optimizer is able to recognize the if
/else
idiom and optimize accordingly! It is not able to make that same optimization in the case of test0
, where you tried to outsmart it.
However when adding an extra variable ... The assembly suddenly becomes different
Well, no surprise there. A small change in the code can often have a significant effect on the emitted code. Optimizers are not magic; they are just really complex pattern matchers. You changed the pattern!
Granted, an optimizing compiler could have used two conditional moves here to generate branchless code. In fact, that is precisely what Clang 3.9 does for test3
(but not for test2
, consistent with our above analysis showing that optimizers may be better able to recognize standard patterns than unusual ones). But GCC doesn't do this. Again, there is no guarantee of a particular optimization being performed.
It seems that the only difference is if the top "mov"s are done before or after the "je".
Now (sorry my assembly is a bit crude), isn't it always better to have the movs after the jump, in order to save pipeline flushes?
No, not really. That would not improve the code in this case. If the branch is mispredicted, you're going to have a pipeline flush no matter what. It doesn't much matter whether the speculatively mispredicted code is a ret
instruction or if it is a mov
instruction.
The only reason it would matter that a ret
instruction immediately followed a conditional branch is if you were writing the assembly code by hand and didn't know to use a rep ret
instruction. This is a trick necessary for certain AMD processors that avoids a branch-prediction penalty. Unless you were an assembly guru, you probably wouldn't have known this trick. But the compiler does, and also knows it is not necessary when you're specifically targeting an Intel processor or different generation of AMD processor that doesn't have this quirk.
However, you might be right about it being better to have the mov
s after the branch, but not for the reason you suggested. Modern processors (I believe this is Nehalem and later, but I'd look it up in Agner Fog's excellent optimization guides if I needed to verify) are capable of macro-op fusion under certain circumstances. Basically, macro-op fusion means that the CPU's decoder will combine two eligible instructions into one micro-op, saving bandwidth at all stages of the pipeline. A cmp
or test
instruction followed by a conditional branch instruction, as you see in test3
, is eligible for macro-op fusion (actually, there are other conditions that must be met, but this code does meet those requirements). Scheduling other instructions in between the cmp
and je
, as you see in test2
, makes macro-op fusion impossible, potentially making the code execute more slowly.
Arguably, though, this is an optimization defect in the compiler. It could have reordered the mov
instructions to place the je
immediately after the cmp
, preserving the ability for macro-op fusion:
test2a(char, int*, int*):
mov DWORD PTR [rsi], 0 ; do the default initialization *first*
mov DWORD PTR [rdx], 0
cmp dil, 99 ; this is now followed immediately by the conditional
je .L10 ; branch, making macro-op fusion possible
rep ret
.L10:
mov DWORD PTR [rsi], 15
mov DWORD PTR [rdx], 21
ret
Another difference between the object code for test2
and test3
is code size. Thanks to padding that is emitted by the optimizer to align the branch target, the code for test3
is 4 bytes larger than test2
. Very unlikely that is enough difference to matter, though, especially if this code is not being executed within a tight loop where it is guaranteed to be hot in the cache.
So, does that mean you should always write the code as you did in test2
?
Well, no, for several reasons:
Even though it may be optimal in certain very simple cases, the "preset" idiom is not generalizable. If your initialization is time-consuming, it may be faster to skip over it when possible. (There is one example discussed here, in the context of VB 6, where string-manipulation is so slow that eliding it when possible actually results in faster execution time than fancy branchless code. More generally, the same rationale would apply if you were able to branch around a function call.)
Even here, where it appears to result in very simple and possibly more optimal code, it may actually be slower because you are writing to memory twice in the case where c
is equal to 99, and saving nothing in the case where c
is not equal to 99.
You might save this cost by rewriting the code such that it accumulates the final value in a temporary register, only storing it to memory at the end, e.g.:
test2b(char, int*, int*):
xor eax, eax ; pre-zero the EAX register
xor ecx, ecx ; pre-zero the ECX register
cmp dil, 99
je Done
mov eax, 15 ; change the value in EAX if necessary
mov ecx, 21 ; change the value in ECX if necessary
Done:
mov DWORD PTR [rsi], eax ; store our final temp values to memory
mov DWORD PTR [rdx], ecx
ret
but this clobbers two additional registers (eax
and ecx
) and may not actually be faster. You'd have to benchmark it. Or trust the compiler to emit this code when it is actually optimal, such as when it has inlined a function like test2
within a tight loop.
Even if you could guarantee that writing the code in a certain way would cause the compiler to emit branchless code, this would not necessarily be faster! While branches are slow when they are mispredicted, mispredictions are actually quite rare. Modern processors have extremely good branch prediction engines, achieving prediction accuracies of greater than 99% in most cases.
Conditional moves are great for avoiding branch mispredictions, but they have the important disadvantage of increasing the length of a dependency chain. By contrast, a correctly predicted branch breaks the dependency chain. (This is probably why GCC doesn't emit two CMOV instructions when you add the extra variable.) A conditional move is only a performance win if you expect branch prediction to fail. If you can count on a prediction success rate of ~75% or better, a conditional branch is probably faster, because it breaks the dependency chain and has a lower latency. And I would suspect that would be the case here, unless c
alternates rapidly back and forth between 99 and not-99 each time the function is called. (See Agner Fog's "Optimizing subroutines in assembly language", pp 70–71.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With