Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Monitoring and alerting on pod status or restart with Google Container Engine (GKE) and Stackdriver

Tags:

Is there a way to monitor the pod status and restart count of pods running in a GKE cluster with Stackdriver?

While I can see CPU, memory and disk usage metrics for all pods in Stackdriver there seems to be no way of getting metrics about crashing pods or pods in a replica set being restarted due to crashes.

I'm using a Kubernetes replica set to manage the pods, hence they are respawned and created with a new name when they crash. As far as I can tell the metrics in Stackdriver appear by pod-name (which is unique for the lifetime of the pod) which doesn't sound really sensible.

Alerting upon pod failures sounds like such a natural thing that it sounds hard to believe that this is not supported at the moment. The monitoring and alerting capabilities that I get from Stackdriver for Google Container Engine as they stand seem to be rather useless as they are all bound to pods whose lifetime can be very short.

So if this doesn't work out of the box are there known workarounds or best practices on how to monitor for continuously crashing pods?

like image 852
ctavan Avatar asked May 04 '17 17:05

ctavan


People also ask

How do you monitor pods in Gke?

In GKE, a simple way to do this is from the integrated command console, known as Cloud Shell. To check the status of the Pods, you can use the kubectl get pods command. GKE also comes ready with the kubectl top command, which allows you to check the resources used by the Pods using kubectl top pods.

Can you monitor container logs for pods running on Gke?

Find your GKE logs in Cloud Logging Alternatively, you can access any of your workloads in your GKE cluster and click on the container logs links in your deployment, pod or container details; this also brings you directly to your logs in the Cloud Logging console.

How do I enable Stackdriver monitoring?

Search for "Monitoring". In the search results, click through to "Stackdriver Monitoring API". If "API enabled" is displayed, then the API is already enabled. If not, then click Enable.

Does Stackdriver for Google container engine support alerting upon pod failures?

Alerting upon pod failures sounds like such a natural thing that it sounds hard to believe that this is not supported at the moment. The monitoring and alerting capabilities that I get from Stackdriver for Google Container Engine as they stand seem to be rather useless as they are all bound to pods whose lifetime can be very short.

Is there a way to monitor POD status and restart Count with Stackdriver?

Is there a way to monitor the pod status and restart count of pods running in a GKE cluster with Stackdriver? While I can see CPU, memory and disk usage metrics for all pods in Stackdriver there seems to be no way of getting metrics about crashing pods or pods in a replica set being restarted due to crashes.

Does Stackdriver integrate with Google Kubernetes Engine (GKE)?

They were the first to offer a container orchestration platform on their cloud in the form of Google Kubernetes Engine (GKE). As Kubernetes matured, so too has GKE. With this in mind, my expectation as a GKE user is that Stackdriver as the native monitoring solution on GCP will integrate neatly with GKE.

What version of Kubernetes do GKE clusters support?

Note: GKE clusters have integrated monitoring and logging support. Two different, and incompatible, versions are provided: Legacy Logging and Monitoring, described on this page, and Cloud Operations for GKE, a newer release which can be used in new or existing Kubernetes clusters running Kubernetes version 1.12.7.


2 Answers

There is a built in metric now, so it's easy to dashboard and/or alert on it without setting up custom metrics

Metric: kubernetes.io/container/restart_count Resource type: k8s_container 
like image 116
dan carter Avatar answered Oct 14 '22 10:10

dan carter


You can achieve this manually with the following:

  1. In Logs Viewer, creating the following filter:

    resource.labels.project_id="<PROJECT_ID>" resource.labels.cluster_name="<CLUSTER_NAME>" resource.labels.namespace_name="<NAMESPACE, or default>" jsonPayload.message:"failed liveness probe" 
  2. Create a metric by clicking on the Create Metric button above the filter input and filling up the details.

  3. You may now track this metric in Stackdriver.

Would be happy to be informed of a built-in metric instead of this.

like image 20
Jonathan Lin Avatar answered Oct 14 '22 10:10

Jonathan Lin