Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alert on missing series/data

I'm trying to understand how can I get Grafana alert me when the metric is not being scraped anymore.

The metric I'm using for this example is mongodb_instance_uptime_seconds. When the instance goes down, the metric is not generated anymore resulting in the metric missing in Prometheus. At the moment the alert triggers on when last() query(A, 1m, now) < 600. As you can see the goal was to alert when the uptime is below 5minutes. Meaning I want to alert restarts and stops but Grafana won't alert when one instance goes down because the last() value does not exist in fact and when the instance is down for more than 5min it's not even reported anymore.

Any clues on how to move forward?

like image 418
rels Avatar asked Oct 15 '18 10:10

rels


2 Answers

The metric that is typically used to determine if an instance is being scraped successfully is up. It is autogenerated by all scrape jobs, so if you want an alert for any scrape endpoint that is down, just use the query up == 0, which will show any endpoints whose last scrape was not successful. If you want to alert only for this specific endpoint, use labels like as up{instance="mongodb.foo.com",job="mongo"} == 0

If you're ever interested in using Alertmanager instead of Grafana for this, the rule would look like:

groups:
- name: General
  rules:
  - alert: Endpoint_Down
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Exporter is down: {{ $labels.instance }}"
      description: "The endpoint {{ $labels.instance }} is not able to be scraped by Prometheus."
like image 119
wbh1 Avatar answered Oct 07 '22 01:10

wbh1


If you know in advance all the labels for the monitored time series, then absent_over_time function may be used for alerting. For example, the following query returns non-empty result (e.g. an alert) when the metric mongodb_instance_uptime_seconds{instance="foo",job="bar"} had no new samples during the last 5 minutes:

absent_over_time(mongodb_instance_uptime_seconds{instance="foo",job="bar"}[5m])

Unfortunately both absent and absent_over_time functions cannot return multiple results if some of the matching time series disappear. For example, if there are two time series:

mongodb_instance_uptime_seconds{instance="foo"}
mongodb_instance_uptime_seconds{instance="bar"}

And only one of these time series stop receiving new samples (suppose, mongodb_instance_uptime_seconds{instance="foo"} has no new samples anymore, while mongodb_instance_uptime_seconds{instance="bar"} continues receiving new samples), then the following queries won't return the expected alert for mongodb_instance_uptime_seconds{instance="foo"}:

absent(mongodb_instance_uptime_seconds)
absent_over_time(mongodb_instance_uptime_seconds[5m])

Prometheus doesn't provide the solution to this issue yet, while VictoriaMetrics provides lag() function, which can be used for alerting in this case. For example, the following MetricsQL query alerts (e.g. returns non-empty result) when at least a single time series with mongodb_instance_uptime_seconds name stops receiving new samples for more than 5 minutes during the last hour:

lag(mongodb_instance_uptime_seconds[1h]) > 5m

This alert remains active for one hour after the time series stops receiving new samples. The duration for active alert can be adjusted by changing the value in square brackets.

like image 37
valyala Avatar answered Oct 07 '22 00:10

valyala