Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does barrier synchronize shared memory when memoryBarrier doesn't?

Tags:

glsl

The following GLSL compute shader simply copies inImage to outImage. It is derived from a more complex post-processing pass.

In the first several lines of main(), a single thread loads 64 pixels of data into the shared array. Then, after synchronizing, each of the 64 threads writes one pixel to the output image.

Depending on how I synchronize, I get different results. I originally thought memoryBarrierShared() would be the correct call, but it produces the following result:

Unsynchronized result

which is the same result as having no synchronization or using memoryBarrier() instead.

If I use barrier(), I get the following (desired) result:

enter image description here

The striping is 32 pixels wide, and if I change the workgroup size to anything less than or equal to 32, I get correct results.

What's going on here? Am I misunderstanding the purpose of memoryBarrierShared()? Why should barrier() work?

#version 430

#define SIZE 64

layout (local_size_x = SIZE, local_size_y = 1, local_size_z = 1) in;

layout(rgba32f) uniform readonly  image2D inImage;
uniform writeonly image2D outImage;

shared vec4 shared_data[SIZE];

void main() {
    ivec2 base = ivec2(gl_WorkGroupID.xy * gl_WorkGroupSize.xy);
    ivec2 my_index = base + ivec2(gl_LocalInvocationID.x,0);

    if (gl_LocalInvocationID.x == 0) {
        for (int i = 0; i < SIZE; i++) {
            shared_data[i] = imageLoad(inImage, base + ivec2(i,0));
        }
    }

    // with no synchronization:   stripes
    // memoryBarrier();        // stripes
    // memoryBarrierShared();  // stripes
    // barrier();              // works

    imageStore(outImage, my_index, shared_data[gl_LocalInvocationID.x]);
}
like image 673
James Wilcox Avatar asked Jul 02 '13 16:07

James Wilcox


1 Answers

The problem with image load store and friends is, that the implementation cannot be sure anymore that a shader only changes the data of it's dedicated output values (e.g. the framebuffer after a fragment shader). This applies even more so to compute shaders, which don't have a dedicated output, but only output things by writing data into writable store, like images, storage buffers or atomic counters. This may require manual synchronization between individual passes as otherwise the fragment shader trying to access a texture might not have the most recent data written into that texture with image store operations by a preceding pass, like your compute shader.

So it may be that your compute shader works perfectly, but it is the synchronization with the following display (or whatever) pass (that needs to read this image data somehow) that fails. For this purpose there exists the glMemoryBarrier function. Depending on how you read that image data in the display pass (or more precisely the pass that reads the image after the compute shader pass), you need to give a different flag to this function. If you read it using a texture, use GL_TEXTURE_FETCH_BARRIER_BIT​, if you use an image load again, use GL_SHADER_IMAGE_ACCESS_BARRIER_BIT​, if using glBlitFramebuffer for display, use GL_FRAMEBUFFER_BARRIER_BIT​...

Though I don't have much experience with image load/store and manual memory snynchronization and this is only what I came up with theoretically. So if anyone knows better or you already use a proper glMemoryBarrier, then feel free to correct me. Likewise does this not need to be your only error (if any). But the last two points from the linked Wiki article actually address your use case and IMHO make it clear that you need some kind of glMemoryBarrier:

  • Data written to image variables in one rendering pass and read by the shader in a later pass need not use coherent variables or memoryBarrier(). Calling glMemoryBarrier with the SHADER_IMAGE_ACCESS_BARRIER_BIT​ set in barriers​ between passes is necessary.

  • Data written by the shader in one rendering pass and read by another mechanism (e.g., vertex or index buffer pulling) in a later pass need not use coherent variables or memoryBarrier(). Calling glMemoryBarrier with the appropriate bits set in barriers​ between passes is necessary.


EDIT: Actually the Wiki article on compute shaders says

Shared variable access uses the rules for incoherent memory access. This means that the user must perform certain synchronization in order to ensure that shared variables are visible.

Shared variables are all implicitly declared coherent​, so you don't need to (and can't use) that qualifier. However, you still need to provide an appropriate memory barrier.

The usual set of memory barriers is available to compute shaders, but they also have access to memoryBarrierShared()​; this barrier is specifically for shared variable ordering. groupMemoryBarrier()​ acts like memoryBarrier()​, ordering memory writes for all kinds of variables, but it only orders read/writes for the current work group.

While all invocations within a work group are said to execute "in parallel", that doesn't mean that you can assume that all of them are executing in lock-step. If you need to ensure that an invocation has written to some variable so that you can read it, you need to synchronize execution with the invocations, not just issue a memory barrier (you still need the memory barrier though).

To synchronize reads and writes between invocations within a work group, you must employ the barrier()​ function. This forces an explicit synchronization between all invocations in the work group. Execution within the work group will not proceed until all other invocations have reach this barrier. Once past the barrier()​, all shared variables previously written across all invocations in the group will be visible.

So this actually sounds like you need the barrier there and the memoryBarrierShared is not enough (though you don't need both, as the last sentence says). The memory barrier will just synchronize the memory, but it doesn't stop the execution of the threads to cross it. Thus the threads won't read any old cached data from the shared memory if the first thread has already written something, but they can very well reach the point of reading before the first thread has tried to write anything at all.

This actually fits perfectly to the fact that for 32 and below block sizes it works and that the first 32 pixels work. At least on NVIDIA hardware 32 is the warp size and thus the number of threads that operate in perfect lock-step. So the first 32 threads (well, every block of 32 threads) always work exactly parallel (well, conceptually that is) and thus they cannot introduce any race-conditions. This is also the case why you don't actually need any synchronization if you know you work inside a single warp, a common optimization.

like image 71
Christian Rau Avatar answered Oct 29 '22 16:10

Christian Rau