Memory barrier fails to sync between compute stage and data access by CUDA

I have the following pipeline:

Render into texture attachment to custom FBO.
Bind that texture attachment as image.
Run compute shader ,sampling from the image above using imageLoad/Store.
Write the results into SSBO or image.
Map the SSBO (or image) as CUDA CUgraphicsResource and process the data from that buffer using CUDA.

Now,the problem is in synchronizing data between the stages 4 and 5. Here are the sync solutions I have tried.

glFlush - doesn't really work as it doesn't guarantee the completeness of the execution of all the commands.

glFinish - this one works. But it is not recommended as it finalizes all the commands submitted to the driver.

ARB_sync Here it is said it is not recommended because it heavily impacts performance.

glMemoryBarrier This one is interesting. But it simply doesn't work.

Here is example of the code:

glMemoryBarrier(GL_ALL_BARRIER_BITS);

And also tried:

glTextureBarrierNV()

The code execution goes like this:

 //rendered into the fbo...
  glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
  glBindImageTexture(imageUnit1, fboTex, 0, GL_FALSE, 0, GL_READ_ONLY,GL_RGBA8); 
  glBindImageTexture(imageUnit2, imageTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA8));
  glDispatchCompute(16, 16, 1);

  glFinish(); // <-- must sync here,otherwise cuda buffer doesn't receive all the data

 //cuda maps the image to CUDA buffer here..

Moreover, I tried unbinding FBOs and unbinding textures from the context before launching compute, I even tried to launch one compute after other with a glMemoryBarrier set between them, and fetching the target image from the first compute launch to CUDA. Still no synch. (Well,that makes sense as two computes also run out of sync with each other)

after the compute shader stage. It doesn't sync! Only when I replace with glFinish,or with any other operation which completely stall the pipeline. Like glMapBuffer(), for example.

~~So should I just use glFinish(), or I am missing something here? Why glMemoryBarrier() doesn't sync compute shader work before CUDA takes over the control?~~

UPDATE

I would like to refactor the question a little bit as the original one is pretty old. Nevertheless, even with the latest CUDA and Video Codec SDK (NVENC) the issue is still alive.So, I don't care about why glMemoryBarrier doesn't sync. What I want to know is:

If it is possible to synchronize OpenGL compute shader execution finish with CUDA's usage of that shared resource without stalling the whole rendering pipeline, which is in my case OpenGL image.
If the answer is 'yes', then how?

What is the advantage of shared memory in CUDA?

Summary. Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access because it is located on chip. Because shared memory is shared by threads in a thread block, it provides a mechanism for threads to cooperate.

Are misaligned data accesses a problem with CUDA hardware?

For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is problematic regardless of the generation of the CUDA hardware, and would seem to be unavoidable in many cases, such as when accessing elements in a multidimensional array along the second and higher dimensions.

How to solve CUDA being out of memory?

Resolving CUDA Being Out of Memory With Gradient Accumulation and AMP Implementing gradient accumulation and automatic mixed precision to solve CUDA out of memory issue when training big deep learning models which requires high batch and input sizes

How much memory do I need for CUDA?

Generally speaking, CUDA applications are limited to the physical memory present on the GPU, minus system overhead. If your GPU supports ECC, and it is turned on, 6.25% or 12.5% of the memory will be used for the extra ECC bits (the exact percentage depends on your GPU).

I know this is an old question, but if any poor soul stumbles upon this...

First, the reason glMemoryBarrier does not work: it requires the OpenGL driver to insert a barrier into the pipeline. CUDA does not care about the OpenGL pipeline at all.

Second, the only other way outside of glFinish is to use glFenceSync in combination with glClientWaitSync:

....
glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
glBindImageTexture(imageUnit1, fboTex, 0, GL_FALSE, 0, GL_READ_ONLY,GL_RGBA8); 
glBindImageTexture(imageUnit2, imageTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA8));
glDispatchCompute(16, 16, 1);
GLsync fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
... other work you might want to do that does not impact the buffer...
GLenum res = glClientWaitSync(fence, GL_SYNC_FLUSH_COMMANDS_BIT, timeoutInNs);
if(res == GL_TIMEOUT_EXPIRED || res == GL_WAIT_FAILED) {
    ...handle timeouts and failures
}
cudaGraphicsMapResources(1, &gfxResource, stream);
...

This will cause the CPU to block until the GPU is done with all commands until the fence. This includes memory transfers and compute operations.

Unfortunately, there is no way to tell CUDA to wait on an OpenGL memory barrier/fence. If you really require the extra bit of asynchronicity, you'll have to switch to DirectX 12, for which CUDA supports importing fences/semaphores and waiting on as well as signaling them from a CUDA stream via cuImportExternalSemaphore, cuWaitExternalSemaphoresAsync, and cuSignalExternalSemaphoresAsync.

Memory barrier fails to sync between compute stage and data access by CUDA

Tags:

cuda

opengl

nvenc

Michael IV

People also ask

1 Answers

IGarFieldI

Recent Activity

Donate For Us

Memory barrier fails to sync between compute stage and data access by CUDA

Tags:

cuda

opengl

nvenc

Michael IV

People also ask

1 Answers

IGarFieldI

Related questions

Recent Activity

Donate For Us