I have a rather simple fragment shader with a branch and I'm a bit unsure how it is handled by the GLSL compiler and how it would affect performance. <pre class="prettyprint"><code>uniform sampler2D sampler; uniform vec2 texSize; uniform vec2 targetSize; void main() { vec4 color; if(texSize == targetSize) color = texture2DNearest(sampler, gl_TexCoord[0]); else color = texture2DBicubic(sampler, gl_TexCoord[0]); gl_FragColor = color; } </code></pre> I have read from an AMDs document that sometimes both branches are executed, which would not be a good idea in this case. Without further information nor access to disassembly I'm unsure what to think about this, and how to avoid it if it is a problem? And also from my understanding a branch based on a uniform variable will not incur any significant overhead since it is constant over a single pass?

Here you have it: <pre class="prettyprint"><code>il_ps_2_0 dcl_input_generic_interp(linear) v1 dcl_resource_id(0)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) eq r2.xy__, c1.xyyy, c0.xyyy imul r5.x___, r2.x, r2.y mov r1.x___, r5.x if_logicalnz r1.x sample_resource(0)_sampler(0) r6, v1.xyyy mov r7, r6 else sample_resource(0)_sampler(0) r8, v1.xyyy mov r7, r8 endif mov r9, r7 mov oC0, r9 endmain </code></pre> To rephrase a bit what Kos said, what matters is to know if the guard condition can be known before execution. This is the case here since <code>c1</code> and <code>c0</code> registers are constant (constant registers start with letter <code>'c'</code>) and so is <code>r1.x</code> register value. That means this value is the same for all invocated fragment shaders, therefore no thread divergence can happen. Btw, I'm using AMD GPU ShaderAnalyser for transforming GLSL into the IL. You can also generate native GPU assembly code for a specific generation (ranging from HD29xx to HD58xx).This is really a good tool!

GLSL branching behaviour

Tags:

opengl

glsl

I have a rather simple fragment shader with a branch and I'm a bit unsure how it is handled by the GLSL compiler and how it would affect performance.

uniform sampler2D sampler;
uniform vec2 texSize;
uniform vec2 targetSize; 

void main()               
{                  
    vec4 color;
    if(texSize == targetSize)
        color = texture2DNearest(sampler, gl_TexCoord[0]);
    else
        color = texture2DBicubic(sampler, gl_TexCoord[0]);
    gl_FragColor = color;        
}

I have read from an AMDs document that sometimes both branches are executed, which would not be a good idea in this case. Without further information nor access to disassembly I'm unsure what to think about this, and how to avoid it if it is a problem?

And also from my understanding a branch based on a uniform variable will not incur any significant overhead since it is constant over a single pass?

375

asked Nov 28 '10 22:11

ronag

2 Answers

Here you have it:

il_ps_2_0
dcl_input_generic_interp(linear) v1
dcl_resource_id(0)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
eq r2.xy__, c1.xyyy, c0.xyyy
imul r5.x___, r2.x, r2.y
mov r1.x___, r5.x
if_logicalnz r1.x
    sample_resource(0)_sampler(0) r6, v1.xyyy
    mov r7, r6
else
    sample_resource(0)_sampler(0) r8, v1.xyyy
    mov r7, r8
endif
mov r9, r7
mov oC0, r9
endmain

To rephrase a bit what Kos said, what matters is to know if the guard condition can be known before execution. This is the case here since c1 and c0 registers are constant (constant registers start with letter 'c') and so is r1.x register value.

That means this value is the same for all invocated fragment shaders, therefore no thread divergence can happen.

Btw, I'm using AMD GPU ShaderAnalyser for transforming GLSL into the IL. You can also generate native GPU assembly code for a specific generation (ranging from HD29xx to HD58xx).This is really a good tool!

169

answered Oct 21 '22 22:10

Stringer

Yes, IIRC you won't hit a performance overhead since all the threads in a single batch (warp) on a single GPU processor will go through a single branch. By 'thread' I mean 'a single execution line of the shader'.

The efficiency problem arises when a part of threads executed at the given time by a given processor (which'd be up to like 32 threads AFAIK; depends on hardware, I'm giving the numbers for G80 architecture) would branch into several branches - two different instructions at a time cannot be executed by one processor, so firstly the "if" branch would be executed by a part of threads (and the remaining would wait), and then the "else" branch would get executed by the rest.

That's not the case with your code, so I believe you're safe.

answered Oct 21 '22 21:10

Kos

Related questions
                            
                                Is it possible to run Java3D applications on Nvidia 3D Vision hardware?
                            
                                Why is there no glBindAttribLocation() equivalent for uniform variables?
                            
                                Shader limitations
                            
                                3 index buffers
                            
                                How to tell whether an OpenGL context is hardware accelerated?
                            
                                Improving window resize behaviour, possibly by manually setting bigger framebuffer size
                            
                                OpenGL GLSL SSAO Implementation
                            
                                Replacement for GL_LUMINANCE, GL_LUMINANCE_ALPHA​
                            
                                How can I check if an object(s) are in front of the camera?
                            
                                Why does sign matter in opengl projection matrix
                            
                                Seam issue when mapping a texture to a sphere in OpenGL
                            
                                Qt: How to detect which version of OpenGL is being used?
                            
                                GLSL object glowing
                            
                                Is there a limit to how many OpenGL rendering contexts you can create simultaneously?
                            
                                How to draw QGLFrameBufferObject onto the painter from within QGraphicsItem::paint()
                            
                                Is drawing front-to-back necessary for optimizing renders?
                            
                                OpenGL - object outline
                            
                                OpenGL GLX extension not supported
                            
                                OpenGL animation
                            
                                The best way to use VBOs [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With