Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

glClear() Takes Too Long - Android OpenGL ES 2

I'm developing an Android app using OpenGL ES 2. The problem I am encountering is that the glClear() function is taking so long to process that the game appears jittery as frames are delayed. The output of a run of the program with timing probes shows that while setting up all vertices and images from the atlas only takes less than 1 millisecond, glClear() takes between 10 and 20 milliseconds. In fact, the clearing often takes up to 95% of the total rendering time. My code is based upon common tutorials, and the Render function is this:

private void Render(float[] m, short[] indices) {
    Log.d("time", "--START RENDER--");

    // get handle to vertex shader's vPosition member
    int mPositionHandle = GLES20.glGetAttribLocation(riGraphicTools.sp_Image, "vPosition");

    // Enable generic vertex attribute array
    GLES20.glEnableVertexAttribArray(mPositionHandle);

    // Prepare the triangle coordinate data
    GLES20.glVertexAttribPointer(mPositionHandle, 3,
    GLES20.GL_FLOAT, true,
    0, vertexBuffer);

    // Get handle to texture coordinates location
    int mTexCoordLoc = GLES20.glGetAttribLocation(riGraphicTools.sp_Image, "a_texCoord" );

    // Enable generic vertex attribute array
    GLES20.glEnableVertexAttribArray ( mTexCoordLoc );

    // Prepare the texturecoordinates
    GLES20.glVertexAttribPointer ( mTexCoordLoc, 2, GLES20.GL_FLOAT,
    false, 
    0, uvBuffer);

    // Get handle to shape's transformation matrix
    int mtrxhandle = GLES20.glGetUniformLocation(riGraphicTools.sp_Image, "uMVPMatrix");

    // Apply the projection and view transformation
    GLES20.glUniformMatrix4fv(mtrxhandle, 1, false, m, 0);

    // Get handle to textures locations
    int mSamplerLoc = GLES20.glGetUniformLocation (riGraphicTools.sp_Image, "s_texture" );

    // Set the sampler texture unit to 0, where we have saved the texture.
    GLES20.glUniform1i ( mSamplerLoc, 0);

    long clearTime = System.nanoTime();
    GLES20.glClear(GLES20.GL_COLOR_BUFFER_BIT);
    Log.d("time", "Clear time is " + (System.nanoTime() - clearTime));

    // Draw the triangles
    GLES20.glDrawElements(GLES20.GL_TRIANGLES, indices.length,
    GLES20.GL_UNSIGNED_SHORT, drawListBuffer);

    // Disable vertex array
    GLES20.glDisableVertexAttribArray(mPositionHandle);
    GLES20.glDisableVertexAttribArray(mTexCoordLoc);

    Log.d("time", "--END RENDER--");
}

I have tried moving the png atlas to /drawable-nodpi but it had no effect.

I have tried using the glFlush() and glFinish() functions as well. Interestingly, if I do not call glClear() then it must automatically be called. This is because the total rendering time is still as high as when it was called, and there is no remnants of the previous frame onscreen. Only the first call to glClear() is time-consuming. If it is called again, the subsequent calls are only 1 or 2 milliseconds.

I have also tried different combinations of parameters (such as GLES20.GL_DEPTH_BUFFER_BIT), and using glClearColor(). The clear time is still high.

Thank you in advance.

like image 764
Ian Avatar asked Apr 10 '15 00:04

Ian


1 Answers

You're not measuring what you think you are. Measuring the elapsed time of an OpenGL API call is mostly meaningless.

Asynchronicity

The key aspect to understand is that OpenGL is an API to pass work to a GPU. The easiest mental model (which largely corresponds to reality) is that when you make OpenGL API calls, you queue up work that will later be submitted to the GPU. For example, if you make a glDraw*() call, picture the call building a work item that gets queued up, and at some point later will be submitted to the GPU for execution.

In other words, the API is highly asynchronous. The work you request by making API calls is not completed by the time the call returns. In most cases, it's not even submitted to the GPU for execution yet. It is only queued up, and will be submitted at some point later, mostly outside your control.

A consequence of this general approach is that the time you measure to make a glClear() call has pretty much nothing to do with how long it takes to clear the framebuffer.

Synchronization

Now that we established how the OpenGL API is asynchronous, the next concept to understand is that a certain level of synchronization is necessary.

Let's look at a workload where the overall throughput is limited by the GPU (either by GPU performance, or because the frame rate is capped by the display refresh). If we kept the whole system entirely asynchronous, and the CPU can produce GPU commands faster than the GPU can process them, we would be queuing up a gradually increasing amount of work. This is undesirable for a couple of reasons:

  • In the extreme case, the amount of queued up work would grow towards infinity, and we would run out of memory just from storing the queued up GPU commands.
  • In apps that need to respond to user input, like games, we would get increasing latency between user input and rendering.

To avoid this, drivers use throttling mechanisms to prevent the CPU from getting too far ahead. The details of how exactly this is handled can be fairly complex. But as a simple model, it might be something like blocking the CPU when it gets more than 1-2 frames ahead of what the GPU has finished rendering. Ideally, you always want some work queued up so that the GPU never goes idle for graphics limited apps, but you want to keep the amount of queued up work as small as possible to minimize memory usage and latency.

Meaning of Your Measurement

With all this background information explained, your measurements should be much less surprising. By far the most likely scenario is that your glClear() call triggers a synchronization, and the time you measure is the time it takes the GPU to catch up sufficiently, until it makes sense to submit more work.

Note that this does not mean that all the previously submitted work needs to complete. Let's look at a sequence that is somewhat hypothetical, but realistic enough to illustrate what can happen:

  • Let's say you make the glClear() call that forms the start of rendering frame n.
  • At this time, frame n - 3 is on the display, and the GPU is busy processing rendering commands for frame n - 2.
  • The driver decides that you really should not be getting more than 2 frames ahead. Therefore, it blocks in your glClear() call until the GPU finished the rendering commands for frame n - 2.
  • It might also decide that it needs to wait until frame n - 2 is shown on the display, which means waiting for the next beam sync.
  • Now that frame n - 2 is on the display, the buffer that previously contained frame n - 3 is not used anymore. It is now ready to be used for frame n, which means that the glClear() command for frame n can now be submitted.

Note that while your glClear() call did all kinds of waiting in this scenario, which you measure as part of the elapsed time spent in the API call, none of this time was used for actually clearing the framebuffer for your frame. You were probably just sitting on some kind of semaphore (or similar synchronization mechanism), waiting for the GPU to complete previously submitted work.

Conclusion

Considering that your measurement is not directly helpful after all, what can you learn from it? Unfortunately not a whole lot.

If you do observe that your frame rate does not meet your target, e.g. because you observe stuttering, or even better because you measure the framerate over a certain time period, the only thing you know for sure is that your rendering is too slow. Going into the details of performance analysis is a topic that is much too big for this format. Just to give you a rough overview of steps you could take:

  • Measure/profile your CPU usage to verify that you are really GPU limited.
  • Use GPU profiling tools that are often available from GPU vendors.
  • Simplify your rendering, or skip parts of it, and see how the performance changes. For example, does it get faster if you simplify the geometry? You might be limited by vertex processing. Does it get faster if you reduce the framebuffer size? Or if you simplify your fragment shaders? You're probably limited by fragment processing.
like image 175
Reto Koradi Avatar answered Sep 22 '22 11:09

Reto Koradi