I have read a lot of threads about CUDA branch divergence, telling me that using ternary operator is better than if/else statements, because ternary operator doesn't result in branch divergence. I wonder, for the following code:
foo = (a > b) ? (bar(a)) : (b);
Where bar is another function or some more complicate statements, is it still true that there is no branch divergence ?
I don't know what sources you consulted, but with the CUDA toolchain there is no noticeable performance difference between the use of the ternary operator and the equivalent if-then-else
sequence in most cases. In the case where such differences are noticed, they are due to second order effects in the code generation, and the code based on if-then-else
sequence may well be faster in my experience. In essence, ternary operators and tightly localized branching are treated in much the same way. There can be no guarantees that a ternary operator may not be translated into machine code containing a branch.
The GPU hardware offers multiple mechanisms that help avoiding branches and the CUDA compiler makes good use of these mechanisms to minimize branches. One is predication, which can be applied to pretty much any instruction. The other is support for select-type instructions which are essentially the hardware equivalent of the ternary operator. The compiler uses if-conversion to translate short branches into branch-less code sequences. Often, it choses a combination of predicated code and a uniform branch. In cases of non-divergent control flow (all threads in a warp take the same branch) the uniform branch skips over the predicated code section.
Except in cases of extreme performance optimization, CUDA can (and should) be written in natural idioms that are clear and appropriate to the task at hand, using either if-then-else
sequences or ternary operators as you see fit. The compiler will take care of the rest.
(I would like to add the comment for @njuffa's answer but my reputation is not enough) I found the performance different between them with my program. The if-clause style costs 4.78ms:
// fin is {0-4}, range_limit = 5
if(fin >= range_limit){
res_set = res_set^1;
fx_ref = fx + (fxw*(float)DEPTH_M_H/(float)DEPTH_BLOCK_X);
fin = 0;
}
// then branch for next loop iteration.
// nvvp report these assemblies.
@!P1 LOP32I.XOR R48, R48, 0x1;
@!P1 FMUL.FTZ R39, R7, 14;
@!P1 MOV R0, RZ;
MOV R40, R48;
{ @!P1 FFMA.FTZ R6, R39, c[0x2][0x0], R5;
@!P0 BRA `(.L_35); } // the predicate also use for loop's branching
And the ternary style costs 4.46ms:
res_set = (fin < range_limit) ? res_set: (res_set ^1);
fx_ref = (fin < range_limit) ? fx_ref : fx + (fxw*(float)DEPTH_M_H/(float)DEPTH_BLOCK_X) ;
fin = (fin < range_limit) ? fin:0;
//comments are where nvvp mark the instructions are for the particular code line
ISETP.GE.AND P2, PT, R34.reuse, c[0x0][0x160], PT; //res_set
FADD.FTZ R27, -R25, R4;
ISETP.LT.AND P0, PT, R34, c[0x0][0x160], PT; //fx_ref
IADD32I R10, R10, 0x1;
SHL R0, R9, 0x2;
SEL R4, R4, R27, P1;
ISETP.LT.AND P1, PT, R10, 0x5, PT;
IADD R33, R0, R26;
{ SEL R0, RZ, 0x1, !P2;
STS [R33], R58; }
{ FADD.FTZ R3, R3, 74.75;
STS [R33+0x8], R29; }
{ @!P0 FMUL.FTZ R28, R4, 14; //fx_ref
STS [R33+0x10], R30; }
{ IADD32I R24, R24, 0x1;
STS [R33+0x18], R31; }
{ LOP.XOR R9, R0, R9; //res_set
STS [R33+0x20], R32; }
{ SEL R0, R34, RZ, P0; //fin
STS [R33+0x28], R36; }
{ @!P0 FFMA.FTZ R2, R28, c[0x2][0x0], R3; //fx_ref
The inserted lines are from the next loop iteration calculation. I think in the case of many instructions shared the same predicate value, the ternary style may provide more opportunity for ILP optimization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With