Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

For CUDA, is there a guarantee that Ternary Operator can avoid branch divergence?

Tags:

cuda

I have read a lot of threads about CUDA branch divergence, telling me that using ternary operator is better than if/else statements, because ternary operator doesn't result in branch divergence. I wonder, for the following code:

foo = (a > b) ? (bar(a)) : (b);

Where bar is another function or some more complicate statements, is it still true that there is no branch divergence ?

like image 324
Wesley Ranger Avatar asked Aug 14 '15 02:08

Wesley Ranger


2 Answers

I don't know what sources you consulted, but with the CUDA toolchain there is no noticeable performance difference between the use of the ternary operator and the equivalent if-then-else sequence in most cases. In the case where such differences are noticed, they are due to second order effects in the code generation, and the code based on if-then-else sequence may well be faster in my experience. In essence, ternary operators and tightly localized branching are treated in much the same way. There can be no guarantees that a ternary operator may not be translated into machine code containing a branch.

The GPU hardware offers multiple mechanisms that help avoiding branches and the CUDA compiler makes good use of these mechanisms to minimize branches. One is predication, which can be applied to pretty much any instruction. The other is support for select-type instructions which are essentially the hardware equivalent of the ternary operator. The compiler uses if-conversion to translate short branches into branch-less code sequences. Often, it choses a combination of predicated code and a uniform branch. In cases of non-divergent control flow (all threads in a warp take the same branch) the uniform branch skips over the predicated code section.

Except in cases of extreme performance optimization, CUDA can (and should) be written in natural idioms that are clear and appropriate to the task at hand, using either if-then-else sequences or ternary operators as you see fit. The compiler will take care of the rest.

like image 72
njuffa Avatar answered Sep 26 '22 01:09

njuffa


(I would like to add the comment for @njuffa's answer but my reputation is not enough) I found the performance different between them with my program. The if-clause style costs 4.78ms:

// fin is {0-4}, range_limit = 5
if(fin >= range_limit){
   res_set = res_set^1;
   fx_ref = fx + (fxw*(float)DEPTH_M_H/(float)DEPTH_BLOCK_X);
   fin = 0;
} 
// then branch for next loop iteration.

// nvvp report these assemblies.
    @!P1 LOP32I.XOR R48, R48, 0x1;
    @!P1 FMUL.FTZ R39, R7, 14;
    @!P1 MOV R0, RZ;
    MOV R40, R48;
    {    @!P1 FFMA.FTZ R6, R39, c[0x2][0x0], R5;
    @!P0 BRA `(.L_35);        }    // the predicate also use for loop's branching

And the ternary style costs 4.46ms:

res_set = (fin < range_limit) ? res_set: (res_set ^1);
fx_ref  = (fin < range_limit) ? fx_ref : fx + (fxw*(float)DEPTH_M_H/(float)DEPTH_BLOCK_X) ;
fin     = (fin < range_limit) ? fin:0;
//comments are where nvvp mark the instructions are for the particular code line
    ISETP.GE.AND P2, PT, R34.reuse, c[0x0][0x160], PT;   //res_set
    FADD.FTZ R27, -R25, R4;
    ISETP.LT.AND P0, PT, R34, c[0x0][0x160], PT;         //fx_ref
    IADD32I R10, R10, 0x1;
    SHL R0, R9, 0x2;
    SEL R4, R4, R27, P1;
    ISETP.LT.AND P1, PT, R10, 0x5, PT;
    IADD R33, R0, R26;
    {         SEL R0, RZ, 0x1, !P2;
    STS [R33], R58;        }
    {         FADD.FTZ R3, R3, 74.75;
    STS [R33+0x8], R29;        }
    {    @!P0 FMUL.FTZ R28, R4, 14;                     //fx_ref
    STS [R33+0x10], R30;        }
    {         IADD32I R24, R24, 0x1;
    STS [R33+0x18], R31;        }
    {         LOP.XOR R9, R0, R9;                      //res_set
    STS [R33+0x20], R32;        }
    {         SEL R0, R34, RZ, P0;                     //fin
    STS [R33+0x28], R36;        }
    {    @!P0 FFMA.FTZ R2, R28, c[0x2][0x0], R3;       //fx_ref

The inserted lines are from the next loop iteration calculation. I think in the case of many instructions shared the same predicate value, the ternary style may provide more opportunity for ILP optimization.

like image 27
Yuttakon Yuttakonkit Avatar answered Sep 25 '22 01:09

Yuttakon Yuttakonkit