Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alertmanager, different interval for different alert rules

I'm using alertmanager to get alerts for prometheus metrics, I have different alert rules for different metrics, is it possible to set different interval for each alert rules, for example for metric1 I have rule1 and I need to check this rule on daily base interval, and for metric2 I have rule2 and this one should be check every 2 hours,

like image 203
SMA Avatar asked Jun 16 '26 23:06

SMA


1 Answers

The for: 5m property is used to ensure that the rule returns true for X continuous minutes before to trigger the alert. For example, in case that there is a spike in cpu usage for 30 seconds, the alert will not be triggered because we set the for property to 5 minutes. Hence this is not the right property for you.

I believe that you can use the repeat_interval of the alert manager to set the time interval to send notifications. Then you have the alert but you fire/trigger it depending on your repeat_interval. This link explains them in detail.

  • group_wait sets how long to initially wait to send a notification for a particular group of alerts.
  • group_interval dictates how long to wait before sending notifications about new alerts that are added to a group of alerts that have been alerted on before
  • repeat_interval is used to determine the wait time before a firing alert that has already been successfully sent to the receiver is sent again.

In order to put them to work, you have to define label's for each alert. For example, in my alerts.yml file I create labels app_type: server and app_type: service:

groups:
- name: monitor_cpu
  rules:
  - alert: job:node_cpu_usage:percentage_gt_50
    expr: 100 * node_cpu_seconds_total{mode="user"} / ignoring(mode) group_left sum(node_cpu_seconds_total) without(mode) > 5.5
    for: 1m
    labels:
      severity: critical
      app_type: server
    annotations:
      summary: "High CPU usage"
      description: "Server {{ $labels.instance }} has high CPU usage."
- name: targets
  rules:
  - alert: monitor_service_down
    expr: up == 0
    for: 1m
    labels:
      severity: critical
      app_type: service
    annotations:
      summary: "Monitor service non-operational"
      description: "Service {{ $labels.instance }} is down."

then I create a route tree to send notifications to different groups by matching the specific label. And here comes the solution that I use. I define different group_wait, group_interval, and repeat_interval for each group. Then you can use the repeat_interval: 1h and the repeat_interval: 24h in different routes leaf:

global:
  smtp_from: '[email protected]'
  smtp_smarthost: smtp.gmail.com:587
  smtp_auth_username: '[email protected]'
  smtp_auth_identity: '[email protected]'
  smtp_auth_password: ''

route:
  receiver: 'admin-team'
  routes:
    - match_re:
        app_type: (server|service)
      receiver: 'admin-team'
      routes:
      - match:
          app_type: server
        receiver: 'admin-team'
        group_wait: 1m
        group_interval: 5m
        repeat_interval: 1h
      - match:
          app_type: service
        receiver: 'dev-team'
        group_wait: 1m
        group_interval: 5m
        repeat_interval: 24h

receivers:
 - name: 'admin-team'
   email_configs:
   - to: '[email protected]'

 - name: 'dev-team'
   email_configs:
   - to: '[email protected]'

Unfortunately, I did not test for 24 hours but with a different gap of minutes and it worked. I think that it will work for long hours as well. enter image description here

like image 195
Felipe Avatar answered Jun 21 '26 05:06

Felipe



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!