We use Prometheus Alertmanager for alerts. Frequently, we are missing metrics because of some connection problems.
So, when metrics are missing, Prometheus clear alerts and send resolved alert. After a few minutes, connection problem fixed and firing alerts repeating.
Is there any way to stop the resolved alerts when metric data missing?
For example; When a node down, other alerts for the node(cpu, disk usage controls) are resolved.
values on alertmanager config:
repeat_interval: 1d
resolve_timeout: 15m
group_wait: 1m30s
group_interval: 5m
scrape_interval: 1m
scrape_timeout: 1m
evaluation_interval: 30s
NodeDown alert:
- alert: NodeDown
expr: up == 0
for: 30s
labels:
severity: critical
alert_group: host
annotations:
summary: "Node is down: instance {{ $labels.instance }}"
description: "Can't react to node_exporter at {{ $labels.instance }}. Probably host is down."
Alertmanager can inhibit (=automatically silence) alerts on certain conditions. You will not see inhibited alerts neither firing, nor resolving until the inhibiting condition is false again. Here is an example of one such rule:
inhibit_rules:
- # Mute alerts with "severity" label equals to "warning" ...
target_matchers:
- severity = warning
# ... when an alert named "ExporterDown" is firing ...
source_matchers:
- alertname = ExporterDown
# ... if both the muted and the firing alerts have exactly the same "job" and "instance" labels.
equal: [instance, job]
To summarize, the above automatically silences all warning alerts for a certain machine, when the metric source is down. The link above will lead you to the documentation, where you can find more on the subject.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With