I've been trying to write a general compute shader Gaussian blur implemenation.
It basically works, however it contains artifacts which change every frame even when the scene is static. I've spent the past few hours trying to debug this. I've gone as far as ensuring bounds aren't exceeded, unrolling all the loops, replacing uniforms with constants, yet the artifacts persist.
I've tested the original code with artifacts on 3 different machines/GPUs (2 nvidia, 1 intel) and they all produce the same results. Simulating the unrolled/constant version of the codes execution with workgroups executed forwards and backwards with plain C++ code doesn't produce these errors.
By allocating a shared array of [96][96] instead of [16][48] I can eliminate most of the artifacts.
This got me to the point of thinking I was missing a logic error, therefore I managed to produce a very simple shader which still produces the error on a smaller scale, I'd appreciate it if someone could point out the cause. I've checked alot of documentation and can't find anything incorrect.
A shared array of 16x48 floats is allocated, this is 3072 bytes, roughly 10% of the minimum shared memory limit.
The shader is launched in 16x16 workgroups, so each thread will write to 3 unique locations, and read back from a single unique location
The texture is then renderer as HSV whereby vals between 0-1 will map to hue 0-360 (red-cyan-red), and out of bounds values will be red.
#version 430
//Execute in 16x16 sized thread blocks
layout(local_size_x=16,local_size_y=16) in;
uniform layout (r32f) restrict writeonly image2D _imageOut;
shared float hoz[16][48];
void main ()
{
//Init shared memory with a big out of bounds value we can identify
hoz[gl_LocalInvocationID.x][gl_LocalInvocationID.y] = 20000.0f;
hoz[gl_LocalInvocationID.x][gl_LocalInvocationID.y+16] = 20000.0f;
hoz[gl_LocalInvocationID.x][gl_LocalInvocationID.y+32] = 20000.0f;
//Sync shared memory
memoryBarrierShared();
//Write the values we want to actually read back
hoz[gl_LocalInvocationID.x][gl_LocalInvocationID.y] = 0.5f;
hoz[gl_LocalInvocationID.x][gl_LocalInvocationID.y+16] = 0.5f;
hoz[gl_LocalInvocationID.x][gl_LocalInvocationID.y+32] = 0.5f;
//Sync shared memory
memoryBarrierShared();
//i=0,8,16 work
//i=1-7,9-5,17 don't work (haven't bothered testing further
const int i = 17;
imageStore(_imageOut, ivec2(gl_GlobalInvocationID.xy), vec4(hoz[gl_LocalInvocationID.x][gl_LocalInvocationID.y+i]));
//Sync shared memory (can't hurt)
memoryBarrierShared();
}
Launching this shader with launch dimensions greater than 8x8 produces artifacts in the affected area of the image.
glDispatchCompute(9, 9, 0);
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
I had to breakpoint and step frames to capture this, took around 14 frames
glDispatchCompute(512/16, 512/16, 0);//Full image is 512x512
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
Again I had to breakpoint and step frames to capture this, when running at 60FPS (vsync) artifacts appeared more frequently/simultaneously.
Groupshared memory would not be appropriate for getting data into the shader. It is for threads within the threadgroup to share data amongst each other. Groupshared memory cannot be initialized or set from outside the shader. It does not have a register ( u0 etc.) since it is not used to communicate outside the shader.
Although it's known as a shader and uses HLSL syntax it functions as a generic program, not a as regular shader used for rendering things. Thus I placed the asset in the Scripts folder. Function library compute shader asset. Open the asset file and remove its default contents.
A compute shader needs to contain a main function known as a kernel, indicated via the #pragma kerneldirective followed by a name, like #pragma surfaceof our surface shader. Add this directive as the first and currently only line, using the name FunctionKernel.
Compute Shader Overview. A compute shader is a programmable shader stage that expands Microsoft Direct3D 11 beyond graphics programming. The compute shader technology is also known as the DirectCompute technology.
memoryBarrierShared();
No, that only makes writes visible to other invocations. You have to make sure that all of the writes have actually happened if you want to be able to read from other invocations' data.
That is done with the barrier
function. Which should be called after the memoryBarrierShared
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With