I know that CPU utilization is given by the percentage of non-idle time over the total time of CPU. In Prometheus, rate
or irate
functions calculate the rate of change in a vector array.
People often calculate the CPU utilisation by the following PromQL expression:
(100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100))
I don't understand how calculating the per second change of non-idle time is equivalent to calculating the CPU usage. Can somebody explain mathematically why this makes sense?
There are a couple of things to unwrap here.
First, rate
vs irate
. Neither the linked question, nor the blog post address this (but Eitan's answer does touch on it). The difference is that rate
estimates the average rate over the requested range (1 minute, in your case) while irate
computes the rate based on the last 2 samples only. Leaving aside the "estimate" part (see this answer if you're curious) the practical difference between the 2 is that rate
will smooth out the result, whereas irate
will return a sampling of CPU usage, which is more likely to show extremes in CPU usage but also more prone to aliasing.
E.g. if you look at Prometheus' CPU usage, you'll notice that it's at a somewhat constant baseline, with a spike every time a large rule group is evaluated. Given a time range that was at least as long as Prometheus' evaluation interval, if you used rate
you'd get a more or less constant CPU usage over time (i.e. a flat line). With irate
(assuming a scrape interval of 5s
) you'd get one of 2 things:
1m
and the evaluation interval was 13s
) you'd get a random sampling of CPU usage and would hopefully see values close to both the highest and lowest CPU usage over time on a graph;1m
resolution and 15s
evaluation interval) then you'd either see the baseline CPU usage everywhere (because you happen to look at 5s
intervals set 1 minute apart, when no rule evaluation happens) or the peak CPU usage everywhere (because you happen to look at 5s
intervals 1 minute apart that each cover a rule evaluation).Regarding the second point, the apparent confusion over what the node_cpu_seconds_total
metric represents, it is a counter. Meaning it's a number that increments continuously and essentially measures the amount of time the CPU was idle since the exporter started. The absolute value is not all that useful (as it depends on when the exporter started and will drop to 0 on every restart). What's interesting about it is by how much it increased over a period of time: from that you can compute for a given period of time a rate of increase per second (average, with rate
; instant, with irate
) or an absolute increase (with increase
). So both rate(node_cpu_seconds_total{mode="idle"}[1m])
and irate(node_cpu_seconds_total{mode="idle"}[1m])
will give you a ratio (between 0.0
and 1.0
) of how much the CPU was idle (over the past minute, and respectively between the last 2 samples).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With