Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenCL/OpenGL interop wasting CPU

Tags:

opengl

opencl

I generate frames in OpenCL 60 times per second using one OpenCL kernel call each time and write them to an OpenGL texture so that I can display them on the screen. There's no performance problem, the frame rate is as expected, the problem however is that it's very wasteful, it keeps at least one CPU core fully busy, even when it has very little to do, like drawing a blank frame at a very low resolution. For comparison when I don't use the OpenGL interop but instead write from the CL kernel to a generic buffer and then copy that buffer back to the host to then display it in another way the frame rate drops a bit (due to the back and forth overhead that the interop makes unnecessary) but then the CPU usage is much lower when there's little to do.

This means that there's something wrong with the way I do the interop that I assume must create some sort of busy wait.

Here's the relevant code, which is the code that is there when I use the interop and not there when I don't use it. In one place of my loop I clear the GL texture and make OpenCL acquire it:

    uint32_t z = 0;
    glClearTexImage(fb.gltex, 0, GL_RGBA, GL_UNSIGNED_BYTE, &z);
    glFlush();
    glFinish();

    clEnqueueAcquireGLObjects(fb.clctx.command_queue, 1,  &fb.cl_srgb, 0, 0, NULL);

Then I enqueue the execution of my OpenCL kernel which writes to the texture as the cl_mem object fb.cl_srgb and later I give control back to OpenGL in order to display the texture on the display:

    clEnqueueReleaseGLObjects(fb.clctx.command_queue, 1, &fb.cl_srgb, 0, 0, NULL);
    clFinish(fb.clctx.command_queue);   // this blocks until the kernel is done writing to the texture and releasing the texture

    // setting GL texture coordinates, probably not relevant to this question
    float hoff = 2. * (fb.h - fb.maxdim.y) / (double) fb.maxdim.y;
    glLoadIdentity();             // Reset the projection matrix
    glViewport(0, 0, fb.maxdim.x, fb.maxdim.y);

    glBegin(GL_QUADS);
    glTexCoord2f(0.f, 0.f); glVertex2f(-1., 1.+hoff);
    glTexCoord2f(1.f, 0.f); glVertex2f(1., 1.+hoff);
    glTexCoord2f(1.f, 1.f); glVertex2f(1., -1.+hoff);
    glTexCoord2f(0.f, 1.f); glVertex2f(-1., -1.+hoff);
    glEnd();

    SDL_GL_SwapWindow(fb.window);

It's hard for me to tell what is causing it because the high CPU usage is in another thread ran by nvopencl64.dll (when I run it on my Windows 10 machine with an nVidia GPU, but I have a similar problem with a laptop with an Intel iGPU, also on Windows 10).

Profiling tells me that most of the CPU time is taken by WaitForSingleObjectEx (exclusive 42% of the CPU time) called from nvopencl64.dll, WaitForMultipleObjects (21%) called from nvoglv64.dll's DrvPresentBuffers and the RtlUserThreadStart (16%) calls that originate the aforementioned WaitForMultipleObjects calls. That's for my nVidia GPU machine, but the situation looks pretty similar on a machine with only an Intel HD 5000 iGPU. So there's clearly something very inefficient going on, probably with lots of threads being started way too often.

like image 867
Michel Rouzic Avatar asked Jun 23 '19 18:06

Michel Rouzic


1 Answers

It seems that when CL_DEVICE_PREFERRED_INTEROP_USER_SYNC is false then manual synchronisation with clEnqueueAcquireGLObjects and clEnqueueReleaseGLObjects is unneeded, except for one clEnqueueAcquireGLObjects call after the initialisation of the OpenGL texture. In that case it seems that glFinish is the only needed form of synchronisation.

like image 171
Michel Rouzic Avatar answered Oct 22 '22 01:10

Michel Rouzic