Could you tell me why <pre class="prettyprint"><code>a =(b>0)?1:0 </code></pre> is better than <pre class="prettyprint"><code>if (b>0)a=1; else a =0; </code></pre> version in CUDA? Please give details. Many thanks. Yik

There was a time when the NVIDIA compiler used idiom testing to generate more efficient code for the ternary operator than if/then/else constructs. This is the results of a small test to see whether this is still the case: <pre class="prettyprint"><code>__global__ void branchTest0(float *a, float *b, float *d) { unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x; float aval = a[tidx], bval = b[tidx]; float z0 = (aval > bval) ? aval : bval; d[tidx] = z0; } __global__ void branchTest1(float *a, float *b, float *d) { unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x; float aval = a[tidx], bval = b[tidx]; float z0; if (aval > bval) { z0 = aval; } else { z0 = bval; } d[tidx] = z0; } </code></pre> Compiling these two kernels for compute capability 2.0 with the CUDA 4.0 release compiler, the comparison section produces this: <pre class="prettyprint"><code>branchTest0: max.f32 %f3, %f1, %f2; </code></pre> and <pre class="prettyprint"><code>branchTest1: setp.gt.f32 %p1, %f1, %f2; selp.f32 %f3, %f1, %f2, %p1; </code></pre> The ternary operator gets compiled into a single floating point maximum instruction, whereas the if/then/else gets compiled into two instructions, a compare followed by a select. Both codes are conditionally executed - neither produces branching. The machine code emitted by the assembler for these is also different and closely replicates the PTX: <pre class="prettyprint"><code>branchTest0: /*0070*/ /*0x00201c00081e0000*/ FMNMX R0, R2, R0, !pt; </code></pre> and <pre class="prettyprint"><code>branchTest1: /*0070*/ /*0x0021dc00220e0000*/ FSETP.GT.AND P0, pt, R2, R0, pt; /*0078*/ /*0x00201c0420000000*/ SEL R0, R2, R0, P0; </code></pre> So it would seem that, at least for Fermi GPUs with CUDA 4.0 with this sort of construct, the ternary operator does produce fewer instructions that an equivalent if/then/else. Whether there is a performance difference between them comes down to microbenchmarking data which I don't have.

Why is "a =(b>0)?1:0" better than "if-else" version in CUDA?

Tags:

cuda

Could you tell me why

a =(b>0)?1:0

is better than

if (b>0)a=1; else a =0;

version in CUDA? Please give details. Many thanks.

Yik

461

asked Aug 18 '11 08:08

Yik

2 Answers

There was a time when the NVIDIA compiler used idiom testing to generate more efficient code for the ternary operator than if/then/else constructs. This is the results of a small test to see whether this is still the case:

__global__ void branchTest0(float *a, float *b, float *d)
{
        unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
        float aval = a[tidx], bval = b[tidx];
        float z0 = (aval > bval) ? aval : bval;

        d[tidx] = z0;
}

__global__ void branchTest1(float *a, float *b, float *d)
{
        unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
        float aval = a[tidx], bval = b[tidx];
        float z0;

        if (aval > bval) {
            z0 = aval;
        } else {
            z0 = bval;
        }
        d[tidx] = z0;
}

Compiling these two kernels for compute capability 2.0 with the CUDA 4.0 release compiler, the comparison section produces this:

branchTest0:
max.f32         %f3, %f1, %f2;

and

branchTest1:
setp.gt.f32     %p1, %f1, %f2;
selp.f32        %f3, %f1, %f2, %p1;

The ternary operator gets compiled into a single floating point maximum instruction, whereas the if/then/else gets compiled into two instructions, a compare followed by a select. Both codes are conditionally executed - neither produces branching. The machine code emitted by the assembler for these is also different and closely replicates the PTX:

branchTest0:
    /*0070*/     /*0x00201c00081e0000*/     FMNMX R0, R2, R0, !pt;

and

branchTest1:
    /*0070*/     /*0x0021dc00220e0000*/     FSETP.GT.AND P0, pt, R2, R0, pt;
    /*0078*/     /*0x00201c0420000000*/     SEL R0, R2, R0, P0;

So it would seem that, at least for Fermi GPUs with CUDA 4.0 with this sort of construct, the ternary operator does produce fewer instructions that an equivalent if/then/else. Whether there is a performance difference between them comes down to microbenchmarking data which I don't have.

answered Sep 29 '22 12:09

talonmies

In general, I would recommend to write CUDA code in a natural style, and let the compiler worry about local branching. Besides predication, the GPU hardware also implements "select" type instructions. Using talonmies's framework and sticking in the original poster's code, I find that the same machine code is produced for both versions with the CUDA 4.0 compiler for sm_20. I used -keep to retain intermediate files, and the cuobjdump utility to produce the disassembly. Both the ternary operator and the if-statement are translated into an FCMP instruction, which is a "select" instruction.

The sample case examined by talonmies is actually a special case. The compiler recognizes some common source code idioms, such as the particular ternary expression frequently used to express max() and min() operations, and generates code accordingly. The equivalent if-statement is not recognized as an idiom.

__global__ void branchTest0(float *bp, float *d) 
{         
    unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
    float b = bp[tidx];
    float a = (b>0)?1:0;
    d[tidx] = a;
} 

__global__ void branchTest1(float *bp, float *d)
{
    unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
    float b = bp[tidx];
    float a;
    if (b>0)a=1; else a =0;
    d[tidx] = a;
}

code for sm_20
        Function : _Z11branchTest1PfS_
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x84009c042c000000*/     S2R R2, SR_Tid_X;
/*0010*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
/*0018*/     /*0x10019de218000000*/     MOV32I R6, 0x4;
/*0020*/     /*0x20009ca320044000*/     IMAD R2, R0, c [0x0] [0x8], R2;
/*0028*/     /*0x1020dc435000c000*/     IMUL.U32.U32.HI R3, R2, 0x4;
/*0030*/     /*0x80211c03200d8000*/     IMAD.U32.U32 R4.CC, R2, R6, c [0x0] [0x20];
/*0038*/     /*0x90315c4348004000*/     IADD.X R5, R3, c [0x0] [0x24];
/*0040*/     /*0xa0209c03200d8000*/     IMAD.U32.U32 R2.CC, R2, R6, c [0x0] [0x28];
/*0048*/     /*0x00401c8584000000*/     LD.E R0, [R4];
/*0050*/     /*0xb030dc4348004000*/     IADD.X R3, R3, c [0x0] [0x2c];
/*0058*/     /*0x03f01c003d80cfe0*/     FCMP.LEU R0, RZ, 0x3f800, R0;
/*0060*/     /*0x00201c8594000000*/     ST.E [R2], R0;
/*0068*/     /*0x00001de780000000*/     EXIT;
        ....................................


        Function : _Z11branchTest0PfS_
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x84009c042c000000*/     S2R R2, SR_Tid_X;
/*0010*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
/*0018*/     /*0x10019de218000000*/     MOV32I R6, 0x4;
/*0020*/     /*0x20009ca320044000*/     IMAD R2, R0, c [0x0] [0x8], R2;
/*0028*/     /*0x1020dc435000c000*/     IMUL.U32.U32.HI R3, R2, 0x4;
/*0030*/     /*0x80211c03200d8000*/     IMAD.U32.U32 R4.CC, R2, R6, c [0x0] [0x20];
/*0038*/     /*0x90315c4348004000*/     IADD.X R5, R3, c [0x0] [0x24];
/*0040*/     /*0xa0209c03200d8000*/     IMAD.U32.U32 R2.CC, R2, R6, c [0x0] [0x28];
/*0048*/     /*0x00401c8584000000*/     LD.E R0, [R4];
/*0050*/     /*0xb030dc4348004000*/     IADD.X R3, R3, c [0x0] [0x2c];
/*0058*/     /*0x03f01c003d80cfe0*/     FCMP.LEU R0, RZ, 0x3f800, R0;
/*0060*/     /*0x00201c8594000000*/     ST.E [R2], R0;
/*0068*/     /*0x00001de780000000*/     EXIT;
        ....................................

answered Sep 29 '22 12:09

njuffa

Related questions
                            
                                CUDA max threads in a block
                            
                                Tensorflow cannot open libcuda.so.1
                            
                                How can I get the nvcc CUDA compiler to optimize more?
                            
                                Why aren't there bank conflicts in global memory for Cuda/OpenCL?
                            
                                CUDA error message : unspecified launch failure
                            
                                What's wrong with casting like (void**)&device_array?
                            
                                nvidia-smi Failed to initialize NVML: GPU access blocked by the operating system
                            
                                Interpreting the verbose output of ptxas, part I
                            
                                Using openMP in the cuda host code?
                            
                                Is it possible to access hard disk directly from gpu?
                            
                                Numpy, BLAS and CUBLAS
                            
                                Allocate 2D Array on Device Memory in CUDA
                            
                                What is the difference between PyCUDA and NumbaPro CUDA Python?
                            
                                Does CPU waits for DEVICE to let it finish its kernel execution....?
                            
                                Texture memory in CUDA: Concept and simple example to demonstrate performance
                            
                                error: cuda_runtime.h: No such file or directory
                            
                                what is difference between "-arch sm_13" and "-arch sm_20"
                            
                                CUDA: __syncthreads() inside if statements
                            
                                Concurrent writes in the same global memory location
                            
                                How to calculate Gflops of a kernel

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With