Could you tell me why
a =(b>0)?1:0
is better than
if (b>0)a=1; else a =0;
version in CUDA? Please give details. Many thanks.
Yik
CUDA C is essentially C/C++ with a few extensions that allow one to execute functions on the GPU using many threads in parallel.
From the CUDA C Programming Guide: printf prints formatted output from a kernel to a host-side output stream. The output buffer for printf() is set to a fixed size before kernel launch (see Associated Host-Side API).
There was a time when the NVIDIA compiler used idiom testing to generate more efficient code for the ternary operator than if/then/else constructs. This is the results of a small test to see whether this is still the case:
__global__ void branchTest0(float *a, float *b, float *d)
{
unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
float aval = a[tidx], bval = b[tidx];
float z0 = (aval > bval) ? aval : bval;
d[tidx] = z0;
}
__global__ void branchTest1(float *a, float *b, float *d)
{
unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
float aval = a[tidx], bval = b[tidx];
float z0;
if (aval > bval) {
z0 = aval;
} else {
z0 = bval;
}
d[tidx] = z0;
}
Compiling these two kernels for compute capability 2.0 with the CUDA 4.0 release compiler, the comparison section produces this:
branchTest0:
max.f32 %f3, %f1, %f2;
and
branchTest1:
setp.gt.f32 %p1, %f1, %f2;
selp.f32 %f3, %f1, %f2, %p1;
The ternary operator gets compiled into a single floating point maximum instruction, whereas the if/then/else gets compiled into two instructions, a compare followed by a select. Both codes are conditionally executed - neither produces branching. The machine code emitted by the assembler for these is also different and closely replicates the PTX:
branchTest0:
/*0070*/ /*0x00201c00081e0000*/ FMNMX R0, R2, R0, !pt;
and
branchTest1:
/*0070*/ /*0x0021dc00220e0000*/ FSETP.GT.AND P0, pt, R2, R0, pt;
/*0078*/ /*0x00201c0420000000*/ SEL R0, R2, R0, P0;
So it would seem that, at least for Fermi GPUs with CUDA 4.0 with this sort of construct, the ternary operator does produce fewer instructions that an equivalent if/then/else. Whether there is a performance difference between them comes down to microbenchmarking data which I don't have.
In general, I would recommend to write CUDA code in a natural style, and let the compiler worry about local branching. Besides predication, the GPU hardware also implements "select" type instructions. Using talonmies's framework and sticking in the original poster's code, I find that the same machine code is produced for both versions with the CUDA 4.0 compiler for sm_20. I used -keep to retain intermediate files, and the cuobjdump utility to produce the disassembly. Both the ternary operator and the if-statement are translated into an FCMP instruction, which is a "select" instruction.
The sample case examined by talonmies is actually a special case. The compiler recognizes some common source code idioms, such as the particular ternary expression frequently used to express max() and min() operations, and generates code accordingly. The equivalent if-statement is not recognized as an idiom.
__global__ void branchTest0(float *bp, float *d)
{
unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
float b = bp[tidx];
float a = (b>0)?1:0;
d[tidx] = a;
}
__global__ void branchTest1(float *bp, float *d)
{
unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
float b = bp[tidx];
float a;
if (b>0)a=1; else a =0;
d[tidx] = a;
}
code for sm_20
Function : _Z11branchTest1PfS_
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x84009c042c000000*/ S2R R2, SR_Tid_X;
/*0010*/ /*0x94001c042c000000*/ S2R R0, SR_CTAid_X;
/*0018*/ /*0x10019de218000000*/ MOV32I R6, 0x4;
/*0020*/ /*0x20009ca320044000*/ IMAD R2, R0, c [0x0] [0x8], R2;
/*0028*/ /*0x1020dc435000c000*/ IMUL.U32.U32.HI R3, R2, 0x4;
/*0030*/ /*0x80211c03200d8000*/ IMAD.U32.U32 R4.CC, R2, R6, c [0x0] [0x20];
/*0038*/ /*0x90315c4348004000*/ IADD.X R5, R3, c [0x0] [0x24];
/*0040*/ /*0xa0209c03200d8000*/ IMAD.U32.U32 R2.CC, R2, R6, c [0x0] [0x28];
/*0048*/ /*0x00401c8584000000*/ LD.E R0, [R4];
/*0050*/ /*0xb030dc4348004000*/ IADD.X R3, R3, c [0x0] [0x2c];
/*0058*/ /*0x03f01c003d80cfe0*/ FCMP.LEU R0, RZ, 0x3f800, R0;
/*0060*/ /*0x00201c8594000000*/ ST.E [R2], R0;
/*0068*/ /*0x00001de780000000*/ EXIT;
....................................
Function : _Z11branchTest0PfS_
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x84009c042c000000*/ S2R R2, SR_Tid_X;
/*0010*/ /*0x94001c042c000000*/ S2R R0, SR_CTAid_X;
/*0018*/ /*0x10019de218000000*/ MOV32I R6, 0x4;
/*0020*/ /*0x20009ca320044000*/ IMAD R2, R0, c [0x0] [0x8], R2;
/*0028*/ /*0x1020dc435000c000*/ IMUL.U32.U32.HI R3, R2, 0x4;
/*0030*/ /*0x80211c03200d8000*/ IMAD.U32.U32 R4.CC, R2, R6, c [0x0] [0x20];
/*0038*/ /*0x90315c4348004000*/ IADD.X R5, R3, c [0x0] [0x24];
/*0040*/ /*0xa0209c03200d8000*/ IMAD.U32.U32 R2.CC, R2, R6, c [0x0] [0x28];
/*0048*/ /*0x00401c8584000000*/ LD.E R0, [R4];
/*0050*/ /*0xb030dc4348004000*/ IADD.X R3, R3, c [0x0] [0x2c];
/*0058*/ /*0x03f01c003d80cfe0*/ FCMP.LEU R0, RZ, 0x3f800, R0;
/*0060*/ /*0x00201c8594000000*/ ST.E [R2], R0;
/*0068*/ /*0x00001de780000000*/ EXIT;
....................................
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With