<p>I like to monitor the containers using Prometheus and cAdvisor so that when a container restart, I get an alert. I wonder if anyone have sample Prometheus alert for this.</p>

<p>If you are running in Kubernetes you can deploy the <code>kube-state-metrics</code> container that publishes the restart metric for pods: https://github.com/kubernetes/kube-state-metrics</p>

How can I alert for container restarted?

3 Answers

I used the following Prometheus alert rule for finding container restarts in an hour(can be modified to max time), It may be helpful for you.

Prometheus Alert Rule Sample

ALERT ContainerRestart/PodRestart
IF rate(kube_pod_container_status_restarts[1h]) * 3600 > 1
FOR 5s
LABELS {action_required = "true", severity="critical/warning/info"}
ANNOTATIONS {DESCRIPTION="Pod {{$labels.namespace}}/{{$labels.pod}} restarting more than once during last one hours.",
SUMMARY="Container {{ $labels.container }} in Pod {{$labels.namespace}}/{{$labels.pod}} restarting more than once times during last one hours."}

rate()

rate(v range-vector) calculates the per-second average rate of increase of the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for. Also, the calculation extrapolates to the ends of the time range, allowing for missed scrapes or imperfect alignment of scrape cycles with the range's time period. The following example expression returns the per-second rate of HTTP requests as measured over the last 5 minutes, per time series in the range vector:

rate(http_requests_total{job="api-server"}[5m])

rate should only be used with counters. It is best suited for alerting, and for graphing of slow-moving counters.

Note that when combining rate() with an aggregation operator (e.g. sum()) or a function aggregating over time (any function ending in _over_time), always take a rate() first, then aggregate. Otherwise rate() cannot detect counter resets when your target restarts.

kube_pod_container_status_restarts_total

Metric Type: Counter

Labels/Tags: container=container-name, namespace=pod-namespace,pod=pod-name

Description: The number of container restarts per pod

169

answered Oct 25 '22 21:10

Balakoteswara Panchakshari

If you are running in Kubernetes you can deploy the kube-state-metrics container that publishes the restart metric for pods: https://github.com/kubernetes/kube-state-metrics

answered Oct 25 '22 22:10

checketts

I use Compose and Swarm deployments, so Kubernetes answers are not an option. So I came to this rules.

- alert: Container (Compose) Too Many Restarts
  expr: count by (instance, name) (count_over_time(container_last_seen{name!="", container_label_restartcount!=""}[15m])) - 1 >= 5
  for: 5m
  annotations:
    summary: "Too many restarts ({{ $value }}) for container \"{{ $labels.name }}\""

- alert: Container (Swarm) Too Many Restarts
  expr: count by (instance, container_label_com_docker_swarm_service_name) (count_over_time(container_last_seen{container_label_com_docker_swarm_service_name!=""}[15m])) - 1 >= 5
  for: 5m
  annotations:
    summary: "Too many restarts ({{ $value }}) for container \"{{ $labels.container_label_com_docker_swarm_service_name }}\""

Basically, both works the same way. There are multiple records for each service but with different labels.

Compose ones are the same except container_label_restartcount label

{instance="instance1",name="service1",container_label_restartcount="1",...}
{instance="instance1",name="service1",container_label_restartcount="2",...}
{instance="instance1",name="service1",container_label_restartcount="3",...}

Swarm looks a bit different, because new container is created when service is restared (e.g. from failed healthcheck). name label is changed, container_label_com_docker_swarm_service_name acts as service name.

{instance="instance1",name="service1.1.<hash1>",container_label_com_docker_swarm_service_name="service1",...}
{instance="instance1",name="service1.1.<hash2>",container_label_com_docker_swarm_service_name="service1",...}
{instance="instance1",name="service1.1.<hash3>",container_label_com_docker_swarm_service_name="service1",...}

So the idea is just to count unique records for each instance and name. I personally think that sending alert for each restart is wrong and not useful. I chose to alert if there are more than 5 restarts over 15m period. In my rules I used container_last_seen metric randomly, it actually doesn't matter, because counting is done by difference in labels. We just need a persistent metric. Also, note the - 1 at the end of the expression. We have to substruct 1, because we are counting unique records, so there are always at least one, if your container is running.

You may need to adapt this example for swarm services with multiple replicas, but you got the idea how to count unique labels.

answered Oct 25 '22 22:10

UnholyRaven

Related questions
                            
                                SKStoreReviewController requestReview() may or may not present and alert?
                            
                                You must pass a component to the function returned by connect. Instead received undefined
                            
                                Remove artifacts from CI manually
                            
                                Google Auth API Javascript idpiframe initialization error on Chrome
                            
                                How to map features from the output of a VectorAssembler back to the column names in Spark ML?
                            
                                missing Cats Functor[Future] instance
                            
                                mapState with setter
                            
                                Selenium - MoveTargetOutOfBoundsException with Firefox
                            
                                What is meant by 'flushing the stream'?
                            
                                Reasons not to commit to master
                            
                                Schedule a Python script via batch on windows (using Anaconda)
                            
                                react-router render menu when path does not match

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I alert for container restarted?

Tags:

prometheus

cadvisor

qingsong

People also ask