Branch and predicated instructions

Tags:

Section 5.4.2 of the CUDA C Programming Guide states that branch divergence is handled either by "branch instructions" or, under certain conditions, "predicated instructions". I don't understand the difference between the two, and why one leads to better performance than the other.

This comment suggests that branch instructions lead to a greater number of executed instructions, stalling due to "branch address resolution and fetch", and overhead due to "the branch itself" and "book keeping for divergence", while predicated instructions incur only the "instruction execution latency to do the condition test and set the predicate". Why?

203

asked May 17 '15 15:05

lodhb

1 Answers

Instruction predication means that an instruction is conditionally executed by a thread depending on a predicate. Threads for which the predicate is true execute the instruction, the rest do nothing.

For example:

var = 0;

// Not taken by all threads
if (condition) {
    var = 1;
} else {
    var = 2;
}

output = var;

Would result in (not actual compiler output):

       mov.s32 var, 0;       // Executed by all threads.
       setp pred, condition; // Executed by all threads, sets predicate.

@pred  mov.s32 var, 1;       // Executed only by threads where pred is true.
@!pred mov.s32 var, 2;       // Executed only by threads where pred is false.
       mov.s32 output, var;  // Executed by all threads.

All in all, that's 3 instructions for the if, no branching. Very efficient.

The equivalent code with branches would look like:

       mov.s32 var, 0;       // Executed by all threads.
       setp pred, condition; // Executed by all threads, sets predicate.

@!pred bra IF_FALSE;         // Conditional branches are predicated instructions.
IF_TRUE:                    // Label for clarity, not actually used.
       mov.s32 var, 1;
       bra IF_END;
IF_FALSE:
       mov.s32 var, 2;
IF_END:
       mov.s32 output, var;

Notice how much longer it is (5 instructions for the if). The conditional branch requires disabling part of the warp, executing the first path, then rolling back to the point where the warp diverged and executing the second path until both converge. It takes longer, requires extra bookkeeping, more code loading (particularly in the case where there are many instructions to execute) and hence more memory requests. All that make branching slower than simple predication.

And actually, in the case of this very simple conditional assignment, the compiler can do even better, with only 2 instructions for the if:

mov.s32 var, 0;       // Executed by all threads.
setp pred, condition; // Executed by all threads, sets predicate.
selp var, 1, 2, pred; // Sets var depending on predicate (true: 1, false: 2).

answered Sep 25 '22 05:09

user703016

Related questions
                            
                                Max number of threads which can be initiated in a single CUDA kernel
                            
                                cudaArray vs. device pointer
                            
                                Having Open MPI related issues while making CUDA 5.0 samples (Mac OS X ML)
                            
                                The different addressing modes of CUDA textures
                            
                                Using constants with CUDA
                            
                                Cannot launch Nvidia nsight
                            
                                Unresolved external symbols in beginners CUDA program
                            
                                Implementing a critical section in CUDA
                            
                                creating arrays in nvidia cuda kernel
                            
                                Feasibility of GPU as a CPU? [closed]
                            
                                CUDA: synchronizing threads
                            
                                How do I use atomicMax on floating-point values in CUDA?
                            
                                Why transposing a CUDA grid (but not its threadblocks) still slowdowns computation?
                            
                                Calculate eigenvalues/eigenvectors of hundreds of small matrices using CUDA
                            
                                How can I use 100% of VRAM on a secondary GPU from a single process on windows 10?
                            
                                What is the best algorithm for this array-comparison problem?
                            
                                __forceinline__ effect at CUDA C __device__ functions
                            
                                Compile cuda code for CPU
                            
                                Simple CUBLAS Matrix Multiplication Example?
                            
                                CUDA small kernel 2d convolution - how to do it

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Branch and predicated instructions

Tags:

cuda

simd

lodhb

People also ask

1 Answers

user703016

Recent Activity

Donate For Us