Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prometheus rate functions and interval selections

Tags:

prometheus

I am doing some monitoring with prometheus and is trying to understand how to properly use the rate functions.

Premise is this; I have a counter, configuration for this is set to ingest new values every 15s.

Now I am trying to graph the per second rate of this, so using the rate function I do this as:

rate(pgbouncer_sent_bytes_total{job="pgbouncer", database="worker"}[1m])

As I interpret the rate function, the query will give me a rolling rate average (in 1m look back windows) at each point in time that is queried. The interval of points is appointed by the resolution used.

Below is a screenshot from the prometheus console including the raw data graph and the plot from the rate query above using a 1m resolution. Now the resulting rate graph here does not really match my expectations looking at the raw data in the bottom graph.

data graphs

The interesting bit it also that the resulting graph will look very different depending on the point in time it is loaded. Simply reloading the same graph a couple of subsequent times will completely shift the looks to a point where it does not even looks as it is representing the same data. Image below is the same dataset a few minutes after, but the same occurs even seconds after.

rate reloaded graph

Could someone shed some light on what is really going on here?

like image 810
Pelleplutt Avatar asked Aug 12 '16 09:08

Pelleplutt


People also ask

What does rate function do in Prometheus?

Prometheus rate function is the process of calculating the average per second rate of value increases. You would use this when you want to view how your server CPU usage has increased over a time range or how many requests come in over a time range and how that number increases.

How is rate calculated in Prometheus?

Prometheus calculates rate(count[d]) at timestamp t in the following way: It obtains raw samples per each time series with count name on the time range (t-d ... t] . Note that t-d timestamp isn't included in the range, while t timestamp is included in the range.

What is the difference between rate and irate in Prometheus?

rate() is generally used when graphing the slow moving counters. While irate() is used when graphing the high volatile counters.

How increase function works in Prometheus?

Prometheus' increase function calculates the counter increase over a specified time frame². The following PromQL expression calculates the number of job executions over the past 5 minutes. Since our job runs at a fixed interval of 30 seconds, our graph should show a value of around 10.


1 Answers

AFAICT the cause for the weird results is (1) the fact that your counter actually only increases once every minute, even though you collect it every 15 seconds combined with (2) Prometheus' rate() implementation discarding every 4th counter increase (in your particular setup).

More precisely, you appear to be computing a 1 minute rate, every 1 minute over a counter scraped at 15 second resolution, increasing every 1 minute (on average).

What this means essentially is that Prometheus will basically slice your 1 hour interval into disjoint 1 minute ranges and estimate the rate over each range. The first value will be the extrapolated rate of increase between points 0 and 3, the second will be the extrapolated rate between points 4 and 7 and so on. Because your counter only actually increases once a minute, you can run into 2 different situations:

  1. Your counter increases happen between point pairs 3-4, 7-8 etc. In this case Prometheus sees an increase rate of zero (because there is no increase between points 0 and 3, points 4 and 7 etc. This seems to be happening in the first half of your first graph.
  2. Your counter increases happen somewhere between points 0-3, 4-7 etc. In this case Prometheus takes the difference between the last and first points in each interval (your actual counter increase), divides it by the time difference between the 2 points (on average 45 seconds), then extrapolates that to 1 minute (essentially overestimating it by a factor of 1.(3) -- I'm eyeballing an increase of ~200k over ~50 minutes, so an average rate of about 67 QPS, whereas rate() returns something closer to 90 QPS). This is what happens in the second half of your graph.

This is also why your graph looks wildly different across refreshes. The argument for the current implementation of rate() is that it is "correct on average". Which, if you look at the whole of your graph, across refreshes, is true. </sarcasm>

Essentially graphing a Prometheus rate() or increase() over a time range R with resolution R will result in aliasing, either overestimating (1.33x in your case) or underestimating (zero in your case) on anything but a smoothly increasing counter.

You can work around it by replacing your expression with:

rate(foo[75s]) / 75  * 60

This way you'll actually get the rate of increase between data points 1 minute apart (a 75 seconds range will almost always return exactly 5 points, so 4 counter increases) and reverse the extrapolation to 75 seconds that Prometheus does. There will be some noise in edge cases (e.g. if your evaluation is aligned with scraping times it's possible to get 6 points in one range and 4 in the next due to scrape interval jitter) but you're getting that anyway with rate().

BTW, you can see the aliasing by increasing the resolution of your graph to something like 1 second (anything 15 seconds or below should show it clearly).

like image 182
Alin Sînpălean Avatar answered Oct 18 '22 20:10

Alin Sînpălean