Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to automatically test Prometheus alerts?

We are about to setup Prometheus for monitoring and alerting for our cloud services including a continous integration & deployment pipeline for the Prometheus service and configuration like alerting rules / thresholds. For that I am thinking about 3 categories I want to write automated tests for:

  1. Basic syntax checks for configuration during deployment (we already do this with promtool and amtool)
  2. Tests for alert rules (what leads to alerts) during deployment
  3. Tests for alert routing (who gets alerted about what) during deployment
  4. Recurring check if the alerting system is working properly in production

Most important part to me right now is testing the alert rules (category 1) but I have found no tooling to do that. I could imagine setting up a Prometheus instance during deployment, feeding it with some metric samples (worrying how would I do that with the Pull-architecture of Prometheus?) and then running queries against it.

The only thing I found so far is a blog post about monitoring the Prometheus Alertmanager chain as a whole related to the third category.

Has anyone done something like that or is there anything I missed?

like image 854
bsingr Avatar asked Oct 05 '17 15:10

bsingr


1 Answers

New version of Prometheus (2.5) allows to write tests for alerts, here is a link. You can check points 1 and 2. You have to define data and expected output (for example in test.yml):

rule_files:
    - alerts.yml
evaluation_interval: 1m
tests:
# Test 1.
- interval: 1m
  # Series data.
  input_series:
      - series: 'up{job="prometheus", instance="localhost:9090"}'
        values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
      - series: 'up{job="node_exporter", instance="localhost:9100"}'
        values: '1+0x6 0 0 0 0 0 0 0 0' # 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

  # Unit test for alerting rules.
  alert_rule_test:
      # Unit test 1.
      - eval_time: 10m
        alertname: InstanceDown
        exp_alerts:
            # Alert 1.
            - exp_labels:
                  severity: page
                  instance: localhost:9090
                  job: prometheus
              exp_annotations:
                  summary: "Instance localhost:9090 down"
                  description: "localhost:9090 of job prometheus has been down for more than 5 minutes."

You can run tests using docker:

docker run \
-v $PROJECT/testing:/tmp \
--entrypoint "/bin/promtool" prom/prometheus:v2.5.0 \
test rules /tmp/test.yml

promtool will validate if your alert InstanceDown from file alerts.yml was active. Advantage of this approach is that you don't have to start Prometheus.

like image 175
KrzyH Avatar answered Nov 01 '22 15:11

KrzyH