Efficiency of branching in shaders

Tags:

I understand that this question may seem somewhat ungrounded, but if someone knows anything theoretical / has practical experience on this topic, it would be great if you share it.

I am attempting to optimize one of my old shaders, which uses a lot of texture lookups.

I've got diffuse, normal, specular maps for each of three possible mapping planes and for some faces which are near to the user I also have to apply mapping techniques, which also bring a lot of texture lookups (like parallax occlusion mapping).

Profiling showed that texture lookups are the bottleneck of the shader and I am willing to remove some of them away. For some cases of the input parameters I already know that part of the texture lookups would be unnecessary and the obvious solution is to do something like (pseudocode):

if (part_actually_needed) {    perform lookups;    perform other steps specific for THIS PART; }  // All other parts.

Now - here comes the question.

I do not remember exactly (that's why I stated the question might be ungrounded), but in some paper I recently read (unfortunately, can't remember the name) something similar to the following was stated:

The performance of the presented technique depends on how efficient the HARDWARE-BASED CONDITIONAL BRANCHING is implemented.

I remembered this kind of statement right before I was about to start refactoring a big number of shaders and implement that if-based optimization I was talking about.

So - right before I start doing that - does someone know something about the efficiency of the branching in shaders? Why could branching give a severe performance penalty in shaders?

And is it even possible that I could only worsen the actual performance with the if-based branching?

You might say - try and see. Yes, that's what I'm going to do if nobody here is helps me :)

But still, what in the if case may be effective for new GPU's could be a nightmare for a bit older ones. And that kind of issue is very hard to forecast unless you have a lot of different GPU's (that's not my case)

So, if anyone knows something about that or has benchmarking experience for these kinds of shaders, I would really appreciate your help.

Few remaining brain cells that are actually working keep telling me that branching on the GPU's might be far not as effective as branching for the CPU (which usually has extremely efficient ways of branch predictions and eliminating cache misses) simply because it's a GPU (or that could be hard / impossible to implement on the GPU).

Unfortunately I am not sure if this statement has anything in common with the real situation...

637

asked Nov 14 '10 04:11

Yippie-Ki-Yay

2 Answers

If the condition is uniform (i.e. constant for the entire pass), then the branch is essentially free because the framework will essentially compile two versions of the shader (branch taken and not) and choose one of these for the entire pass based on your input variable. In this case, definitely go for the if statement as it will make your shader faster.

If the condition varies per vertex/pixel, then it can indeed degrade performance and older shader models don't even support dynamic branching.

118

answered Sep 28 '22 16:09

casablanca

Unfortunately, I think the real answer here is to do practical testing with a performance analyser of your specific case, on your target hardware. Particularly given that it sounds like you're at project optimisation stage; this is the only way to take into account the fact that hardware changes frequently and the nature of the specific shader.

On a CPU, if you get a mispredicted branch, you'll cause a pipeline flush and since CPU pipelines are so deep, you'll effectively lose something in the order of 20 or more cycles. On the GPU things a little different; the pipeline are likely to be far shallower, but there's no branch prediction and all of the shader code will be in fast memory -- but that's not the real difference.

It's difficult to know the exact details of everything that's going on, because nVidia and ATI are relatively tight-lipped, but the key thing is that GPUs are made for massively parallel execution. There are many asynchronous shader cores, but each core is again designed to run multiple threads. My understanding is that each core expects to run the same instruction on all it's threads on any given cycle (nVidia calls this collection of threads a "warp").

In this case, a thread might represent a vertex, a geometry element or a pixel/fragment and a warp is a collection of about 32 of those. For pixels, they're likely to be pixels that are close to each other on screen. The problem is, if within one warp, different threads make different decisions at the conditional jump, the warp has diverged and is no longer running the same instruction for every thread. The hardware can handle this, but it's not entirely clear (to me, at least) how it does so. It's also likely to be handled slightly differently for each successive generation of cards. The newest, most general CUDA/compute-shader friendly nVidias might have the best implementation; older cards might have a poorer implementation. The worse case is you may find many threads executing both sides of if/else statements.

One of the great tricks with shaders is learning how to leverage this massively parallel paradigm. Sometimes that means using extra passes, temporary offscreen buffers and stencil buffers to push logic up out of the shaders and onto the CPU. Sometimes an optimisation may appear to burn more cycles, but it could actually be reducing some hidden overhead.

Also note that you can explicitly mark if statements in DirectX shaders as [branch] or [flatten]. The flatten style gives you the right result, but always executes all in the instructions. If you don't explicitly choose one, the compiler can choose one for you -- and may pick [flatten], which is no good for your example.

One thing to remember is that if you jump over the first texture lookup, this will confuse the hardware's texture coordinate derivative math. You'll get compiler errors and it's best not to do so, otherwise you might miss out on some of the better texturing support.

answered Sep 28 '22 17:09

David Jewsbury

Related questions
                            
                                Subquery v/s inner join in sql server
                            
                                BETWEEN clause versus <= AND >=
                            
                                Using AVX intrinsics instead of SSE does not improve speed -- why?
                            
                                B trees vs binary trees
                            
                                Why is creating a HashMap faster than creating an Object[]?
                            
                                What is the performance hit of Performance Counters
                            
                                Is fastcall really faster?
                            
                                How is the jQuery selector $('#foo a') evaluated?
                            
                                Fastest technique to pass messages between processes on Linux?
                            
                                StringBuilder/StringBuffer vs. "+" Operator
                            
                                Why is this loop faster than a dictionary comprehension for creating a dictionary?
                            
                                JavaScript performance difference between double equals (==) and triple equals (===)
                            
                                Which is better way to calculate nCr
                            
                                Random sample of character vector, without elements prefixing one another
                            
                                Difference between mt_rand() and rand()
                            
                                Does php run faster without warnings?
                            
                                Number of partitions in RDD and performance in Spark
                            
                                What's the fastest way to concatenate two Strings in Java?
                            
                                What is the best way to generate all possible three letter strings?
                            
                                Getting an accurate execution time in C++ (micro seconds)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiency of branching in shaders

Tags:

performance

branch

shader

Yippie-Ki-Yay

People also ask

2 Answers

casablanca

David Jewsbury

Recent Activity

Donate For Us