Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Branch predication on GPU

I have a question about branch predication in GPUs. As far as I know, in GPUs, they do predication with branches.

For example I have a code like this:

if (C)
 A
else
 B

so if A takes 40 cycles and B takes 50 cycles to finish execution, if assuming for one warp, both A and B are executed, so does it take in total 90 cycles to finish this branch? Or do they overlap A and B, i.e., when some instructions of A are executed, then wait for memory request, then some instructions of B are executed, then wait for memory, and so on? Thanks

like image 380
Zk1001 Avatar asked Jul 05 '11 11:07

Zk1001


People also ask

What is branching in GPU?

February 16, 2021 at 08:21. Coherent in terms of branches on a GPU means that all vector threads take similar path, or a “regular” path. For example if you have 30 / 32 threads take same path of the branch, it's very coherent; or if 16 take it, but the first 16, not like 0 1 1 0 1 …

How good is branch prediction?

Using a random or pseudorandom bit (a pure guess) would guarantee every branch a 50% correct prediction rate, which cannot be improved (or worsened) by reordering instructions. (With the simplest static prediction of "assume take", compilers can reorder instructions to get better than 50% correct prediction.)

What is an example of branch prediction?

Techopedia Explains Branch Prediction A CPU using branch prediction only executes statements if a predicate is true. One example is using conditional logic. Since unnecessary code is not executed, the processor can work much more efficiently.

How many cycles does branch prediction take?

On modern processors it takes between one and twenty CPU cycles. There are at least four categories of control flow instructions: unconditional branch (jmp on x86), call/return, conditional branch (e.g. je on x86) taken and conditional branch not taken.


1 Answers

All of the CUDA capable architectures released so far operate like an SIMD machine. When there is branch divergence within a warp, both code paths are executed by all the threads in the warp, with the threads which are not following the active path executing the functional equivalent of a NOP (I think I recall that there is a conditional execution flag attached to each thread in a warp which allows non executing threads to be masked off).

So in your example, the 90 cycles answer is probably a better approximation of what really happens than the alternative.

like image 155
talonmies Avatar answered Sep 19 '22 14:09

talonmies