Prometheus rate functions and interval selections

Tags:

prometheus

I am doing some monitoring with prometheus and is trying to understand how to properly use the rate functions.

Premise is this; I have a counter, configuration for this is set to ingest new values every 15s.

Now I am trying to graph the per second rate of this, so using the rate function I do this as:

rate(pgbouncer_sent_bytes_total{job="pgbouncer", database="worker"}[1m])

As I interpret the rate function, the query will give me a rolling rate average (in 1m look back windows) at each point in time that is queried. The interval of points is appointed by the resolution used.

Below is a screenshot from the prometheus console including the raw data graph and the plot from the rate query above using a 1m resolution. Now the resulting rate graph here does not really match my expectations looking at the raw data in the bottom graph.

data graphs

The interesting bit it also that the resulting graph will look very different depending on the point in time it is loaded. Simply reloading the same graph a couple of subsequent times will completely shift the looks to a point where it does not even looks as it is representing the same data. Image below is the same dataset a few minutes after, but the same occurs even seconds after.

rate reloaded graph

Could someone shed some light on what is really going on here?

810

asked Aug 12 '16 09:08

Pelleplutt

1 Answers

AFAICT the cause for the weird results is (1) the fact that your counter actually only increases once every minute, even though you collect it every 15 seconds combined with (2) Prometheus' rate() implementation discarding every 4th counter increase (in your particular setup).

More precisely, you appear to be computing a 1 minute rate, every 1 minute over a counter scraped at 15 second resolution, increasing every 1 minute (on average).

What this means essentially is that Prometheus will basically slice your 1 hour interval into disjoint 1 minute ranges and estimate the rate over each range. The first value will be the extrapolated rate of increase between points 0 and 3, the second will be the extrapolated rate between points 4 and 7 and so on. Because your counter only actually increases once a minute, you can run into 2 different situations:

Your counter increases happen between point pairs 3-4, 7-8 etc. In this case Prometheus sees an increase rate of zero (because there is no increase between points 0 and 3, points 4 and 7 etc. This seems to be happening in the first half of your first graph.
Your counter increases happen somewhere between points 0-3, 4-7 etc. In this case Prometheus takes the difference between the last and first points in each interval (your actual counter increase), divides it by the time difference between the 2 points (on average 45 seconds), then extrapolates that to 1 minute (essentially overestimating it by a factor of 1.(3) -- I'm eyeballing an increase of ~200k over ~50 minutes, so an average rate of about 67 QPS, whereas rate() returns something closer to 90 QPS). This is what happens in the second half of your graph.

This is also why your graph looks wildly different across refreshes. The argument for the current implementation of rate() is that it is "correct on average". Which, if you look at the whole of your graph, across refreshes, is true. </sarcasm>

Essentially graphing a Prometheus rate() or increase() over a time range R with resolution R will result in aliasing, either overestimating (1.33x in your case) or underestimating (zero in your case) on anything but a smoothly increasing counter.

You can work around it by replacing your expression with:

rate(foo[75s]) / 75  * 60

This way you'll actually get the rate of increase between data points 1 minute apart (a 75 seconds range will almost always return exactly 5 points, so 4 counter increases) and reverse the extrapolation to 75 seconds that Prometheus does. There will be some noise in edge cases (e.g. if your evaluation is aligned with scraping times it's possible to get 6 points in one range and 4 in the next due to scrape interval jitter) but you're getting that anyway with rate().

BTW, you can see the aliasing by increasing the resolution of your graph to something like 1 second (anything 15 seconds or below should show it clearly).

182

answered Oct 18 '22 20:10

Alin Sînpălean

Related questions
                            
                                Measure service latency with Prometheus
                            
                                Custom Metrics for Actuator Prometheus
                            
                                How do you add scrape targets to a Prometheus server that was installed with Kubernetes-Helm?
                            
                                Grafana HTTP Error Bad Gateway and Templating init failed errors
                            
                                Monitor custom kubernetes pod metrics using Prometheus
                            
                                custom path for prometheus actuator
                            
                                Prometheus 2.x Limit Memory Usage
                            
                                Prometheus how to handle counters on server
                            
                                How do I delete a time series from Prometheus v2, specifically a series of alerts
                            
                                Prometheus return no data when calculating a ratio of two metrics
                            
                                How to silence Prometheus Alertmanager using config files?
                            
                                Prometheus - exclude 0 values from query result
                            
                                Filter prometheus results by metric value, not by label value
                            
                                How to automatically test Prometheus alerts?
                            
                                Dynamically add targets to a Prometheus configuration
                            
                                prometheus doesn't match regex query
                            
                                What is the maximum scrape_interval in Prometheus
                            
                                How do I get a pod's (milli)core CPU usage with Prometheus in Kubernetes?
                            
                                Get total and free disk space using Prometheus

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With