Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the correct way to calculate the FPS given that GPUs have a task queue and are asynchronous?

I always assumed that the correct way to calculate the FPS was to simply time how long it took to do an iteration of your draw loop. And much of the internet seems to be in accordance.

But!

Modern graphics card are treated as asynchronous servers, so the draw loop sends out drawing instructions for vertex/texture/etc data already on the GPU. These calls do not block the calling thread until the request on the GPU completes, they are simply added to the GPU's task queue. So the surely the 'traditional' (and rather ubiquitous) method is just measuring the call dispatch time?

What prompted me to ask was I had implemented the traditional method and it gave consistently absurdly high framerates, even if what was being rendered caused the animation to become choppy. Re-reading my OpenGL SuperBible brought me to glGenQueries which allow me to time sections of the rendering pipeline.

To summarise, is the 'traditional' way of calculating FPS totally defunct with (barely) modern graphics cards? If so why are the GPU profiling techniques relatively unknown?

like image 826
cmannett85 Avatar asked Jan 08 '12 18:01

cmannett85


2 Answers

Measuring fps is hard. It's made harder by the fact that various people who want to measure fps don't necessarily want to measure the same thing. So ask yourself this. Why do you want an fps number?

Before I go on and dive into all the pitfalls and potential solutions, I do want to point out that this is by no means a problem specific to "modern graphics cards". If anything, it used to be way worse, with SGI-type machines where the rendering actually happened on a graphics susbsystem that could be remote to the client (as in, physically remote). GL1.0 was actually defined in terms of client-server.

Anyways. Back to the problem at hand.

fps, meaning frames per second, really is trying to convey, in a single number, a rough idea of the performance of your application, in a number that can be directly related to things like the screen refresh rate. for a 1st level approximation of performance, it does an ok job. It breaks completely as soon as you want to delve into more fine-grained analysis.

The problem is really that the thing that matters most as far as "feeling of smoothness" of an application, is when the picture you drew ends up on the screen. The secondary thing that matters quite a bit too is how long it took between the time you triggered an action and when its effect shows up on screen (the total latency).

As an application draws a series of frames, it submits them at times s0, s1, s2, s3,... and they end up showing on screen at t0, t1, t2, t3,...

To feel smooth you need all the following things:

  1. tn-sn is not too high (latency)
  2. t(n+1)-t(n) is small (under 30ms)
  3. there is also a hard constraint on the simulation delta time, which I'll talk about later.

When you measure the CPU time for your rendering, you end up measuring s1-s0 to approximate t1-t0. As it turns out, this, on average, is not far from the truth, as client code will never go "too far ahead" (this is assuming you're rendering frames all the time though. See below for other cases). What does happen in fact is that the GL will end up blocking the CPU (typically at SwapBuffer time) when it tries to go too far ahead. That blocking time is roughly the extra time taken by the GPU compared to the CPU on a single frame.

If you really want to measure t1-t0, as you mentioned in your own post, Queries are closer to it. But... Things are never really that simple. The first problem is that if you're CPU bound (meaning your CPU is not quick enough to always provide work to the GPU), then a part of the time t1-t0 is actually idle GPU time. That won't get captured by a Query. The next problem you hit is that depending on your environment (display compositing environment, vsync), queries may actually only measure the time your application spends on rendering to a back buffer, which is not the full rendering time (as the display has not been updated at that time). It does get you a rough idea of how long your rendering will take, but will not be precise either. Further note that Queries are also subject to the asynchronicity of the graphics part. So if your GPU is idle part of the time, the query may miss that part. (e.g. say your CPU is taking very long (100ms) to submit your frame. The the GPU executes the full frame in 10ms. Your query will likely report 10ms, even though the total processing time was closer to 100ms...).

Now, with respect to "event-based rendering" as opposed to continuous one I've discussed so far. fps for those types of workloads doesn't make much sense, as the goal is not to draw as many f per s as possible. There the natural metric for GPU performance is ms/f. That said, it is only a small part of the picture. What really matters there is the time it took from the time you decided you wanted to update the screen and the time it happened. Unfortunately, that number is hard to find: It typically starts when you receive an event that triggers the process and ends when the screen is updated (something that you can only measure with a camera capturing the screen output...).

The problem is that between the 2, you have potential overlap between the CPU and GPU processing, or not (or even, some delay between the time the CPU stops submitting commands and the GPU starts executing them). And that is completely up to the implementation to decide. The best you can do is to call glFinish at the end of the rendering to know for sure the GPU is done processing the commands you sent, and measure the time on the CPU. That solution does reduce the overall performance of the CPU side, and potentially the GPU side as well if you were going to submit the next event right after...

Last the discussion about the "hard constraint on simulation delta time":

A typical animation uses a delta time between frames to move the animation forward. The major problem is that for a fully smooth animation, you really want the delta time you use when submitting your frame at s1 to be t1-t0 (so that when t1 shows, the time that actually was spent from the previous frame was indeed t1-t0). The problem of course is that you have no idea what t1-t0 is at the time you submit s1... So you typically use an approximation. Many just use s1-s0, but that can break down - e.g. SLI-type systems can have some delays in AFR rendering between the various GPUs). You could also try to use an approximation of t1-t0 (or more likely t0-t(-1)) through queries. The result of getting this wrong is mostly likely micro-stuttering on SLI systems.

The most robust solution is to say "lock to 30fps, and always use 1/30s". It's also the one that allows the least leeway on content and hardware, as you have to ensure your rendering can indeed be done in those 33ms... But is what some console developers choose to do (fixed hardware makes it somewhat simpler).

like image 133
Bahbar Avatar answered Oct 20 '22 23:10

Bahbar


"And much of the internet seems to be in accordance." doesn't seem totally correct for me:

Most of the publications would measure how long it takes to MANY iterations, then normalize. This way you can reasonably assume, that filling (and aemptying) the pipe are only a small part of the overall time.

like image 21
Eugen Rieck Avatar answered Oct 21 '22 00:10

Eugen Rieck