I have the following pipeline:
Now,the problem is in synchronizing data between the stages 4 and 5. Here are the sync solutions I have tried.
glFlush - doesn't really work as it doesn't guarantee the completeness of the execution of all the commands.
glFinish - this one works. But it is not recommended as it finalizes all the commands submitted to the driver.
ARB_sync Here it is said it is not recommended because it heavily impacts performance.
glMemoryBarrier This one is interesting. But it simply doesn't work.
Here is example of the code:
glMemoryBarrier(GL_ALL_BARRIER_BITS);
And also tried:
glTextureBarrierNV()
The code execution goes like this:
//rendered into the fbo...
glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
glBindImageTexture(imageUnit1, fboTex, 0, GL_FALSE, 0, GL_READ_ONLY,GL_RGBA8);
glBindImageTexture(imageUnit2, imageTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA8));
glDispatchCompute(16, 16, 1);
glFinish(); // <-- must sync here,otherwise cuda buffer doesn't receive all the data
//cuda maps the image to CUDA buffer here..
Moreover, I tried unbinding FBOs and unbinding textures from the context before launching compute, I even tried to launch one compute after other with a glMemoryBarrier
set between them, and fetching the target image from the first compute launch to CUDA. Still no synch. (Well,that makes sense as two computes also run out of sync with each other)
after the compute shader stage. It doesn't sync! Only when I replace with glFinish
,or with any other operation which completely stall the pipeline.
Like glMapBuffer()
, for example.
So should I just use glFinish(), or I am missing something here?
Why glMemoryBarrier() doesn't sync compute shader work before CUDA takes over the control?
UPDATE
I would like to refactor the question a little bit as the original one is pretty old. Nevertheless, even with the latest CUDA and Video Codec SDK (NVENC) the issue is still alive.So, I don't care about why glMemoryBarrier
doesn't sync. What I want to know is:
If it is possible to synchronize OpenGL compute shader execution finish with CUDA's usage of that shared resource without stalling the whole rendering pipeline, which is in my case OpenGL image.
If the answer is 'yes', then how?
Summary. Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access because it is located on chip. Because shared memory is shared by threads in a thread block, it provides a mechanism for threads to cooperate.
For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is problematic regardless of the generation of the CUDA hardware, and would seem to be unavoidable in many cases, such as when accessing elements in a multidimensional array along the second and higher dimensions.
Resolving CUDA Being Out of Memory With Gradient Accumulation and AMP Implementing gradient accumulation and automatic mixed precision to solve CUDA out of memory issue when training big deep learning models which requires high batch and input sizes
Generally speaking, CUDA applications are limited to the physical memory present on the GPU, minus system overhead. If your GPU supports ECC, and it is turned on, 6.25% or 12.5% of the memory will be used for the extra ECC bits (the exact percentage depends on your GPU).
I know this is an old question, but if any poor soul stumbles upon this...
First, the reason glMemoryBarrier
does not work: it requires the OpenGL driver to insert a barrier into the pipeline. CUDA does not care about the OpenGL pipeline at all.
Second, the only other way outside of glFinish
is to use glFenceSync
in combination with glClientWaitSync
:
....
glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
glBindImageTexture(imageUnit1, fboTex, 0, GL_FALSE, 0, GL_READ_ONLY,GL_RGBA8);
glBindImageTexture(imageUnit2, imageTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA8));
glDispatchCompute(16, 16, 1);
GLsync fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
... other work you might want to do that does not impact the buffer...
GLenum res = glClientWaitSync(fence, GL_SYNC_FLUSH_COMMANDS_BIT, timeoutInNs);
if(res == GL_TIMEOUT_EXPIRED || res == GL_WAIT_FAILED) {
...handle timeouts and failures
}
cudaGraphicsMapResources(1, &gfxResource, stream);
...
This will cause the CPU to block until the GPU is done with all commands until the fence. This includes memory transfers and compute operations.
Unfortunately, there is no way to tell CUDA to wait on an OpenGL memory barrier/fence. If you really require the extra bit of asynchronicity, you'll have to switch to DirectX 12, for which CUDA supports importing fences/semaphores and waiting on as well as signaling them from a CUDA stream via cuImportExternalSemaphore
, cuWaitExternalSemaphoresAsync
, and cuSignalExternalSemaphoresAsync
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With