Is branch divergence really so bad?

Tags:

I've seen many questions scattered across the Internet about branch divergence, and how to avoid it. However, even after reading dozens of articles on how CUDA works, I can't seem to see how avoiding branch divergence helps in most cases. Before anyone jumps on on me with claws outstretched, allow me to describe what I consider to be "most cases".

It seems to me that most instances of branch divergence involve a number of truly distinct blocks of code. For example, we have the following scenario:

if (A):   foo(A) else:   bar(B)

If we have two threads that encounter this divergence, thread 1 will execute first, taking path A. Following this, thread 2 will take path B. In order to remove the divergence, we might change the block above to read like this:

foo(A) bar(B)

Assuming it is safe to call foo(A) on thread 2 and bar(B) on thread 1, one might expect performance to improve. However, here's the way I see it:

In the first case, threads 1 and 2 execute in serial. Call this two clock cycles.

In the second case, threads 1 and 2 execute foo(A) in parallel, then execute bar(B) in parallel. This still looks to me like two clock cycles, the difference is that in the former case, if foo(A) involves a read from memory, I imagine thread 2 can begin execution during that latency, which results in latency hiding. If this is the case, the branch divergent code is faster.

830

asked Jun 20 '13 20:06

longbowrocks

1 Answers

You're assuming (at least it's the example you give and the only reference you make) that the only way to avoid branch divergence is to allow all threads to execute all the code.

In that case I agree there's not much difference.

But avoiding branch divergence probably has more to do with algorithm re-structuring at a higher level than just the addition or removal of some if statements and making code "safe" to execute in all threads.

I'll offer up one example. Suppose I know that odd threads will need to handle the blue component of a pixel and even threads will need to handle the green component:

#define N 2 // number of pixel components #define BLUE 0 #define GREEN 1 // pixel order: px0BL px0GR px1BL px1GR ...   if (threadIdx.x & 1)  foo(pixel(N*threadIdx.x+BLUE)); else                  bar(pixel(N*threadIdx.x+GREEN));

This means that every alternate thread is taking a given path, whether it be foo or bar. So now my warp takes twice as long to execute.

However, if I rearrange my pixel data so that the color components are contiguous perhaps in chunks of 32 pixels: BL0 BL1 BL2 ... GR0 GR1 GR2 ...

I can write similar code:

if (threadIdx.x & 32)  foo(pixel(threadIdx.x)); else                   bar(pixel(threadIdx.x));

It still looks like I have the possibility for divergence. But since the divergence happens on warp boundaries, a give warp executes either the if path or the else path, so no actual divergence occurs.

This is a trivial example, and probably stupid, but it illustrates that there may be ways to work around warp divergence that don't involve running all the code of all the divergent paths.

103

answered Oct 07 '22 14:10

Robert Crovella

Related questions
                            
                                jQuery animate() and browser performance
                            
                                PHP landmines in general [closed]
                            
                                Most appropriate way to get this: $($(".answer")[0])
                            
                                Why is hashCode slower than a similar method?
                            
                                Optimizing numerical array performance in Haskell
                            
                                Significant FMA performance anomaly experienced in the Intel Broadwell processor
                            
                                Javascript: What's the algorithmic performance of 'splice'?
                            
                                Excel VBA Performance - 1 million rows - Delete rows containing a value, in less than 1 min
                            
                                Why is list(x for x in a) faster for a=[0] than for a=[]?
                            
                                When are VBOs faster than "simple" OpenGL primitives (glBegin())?
                            
                                ReSharper sluggishness
                            
                                Speed of calculating powers (in python)
                            
                                Most efficient list to data.frame method?
                            
                                Fast calculation of min, max, and average of incoming numbers
                            
                                Is a statically linked executable faster than a dynamically linked executable?
                            
                                Benchmarking Performance on Java VM vs .NET CLR
                            
                                Why is memcmp so much faster than a for loop check?
                            
                                Most efficient way to check if $string starts with $needle in perl
                            
                                unordered_map: which one is faster find() or count()?
                            
                                Writing to console with System.out and PrintWriter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is branch divergence really so bad?

Tags:

performance

branch

cuda

longbowrocks

People also ask

1 Answers

Robert Crovella

Recent Activity

Donate For Us