Monitoring and alerting on pod status or restart with Google Container Engine (GKE) and Stackdriver

Tags:

Is there a way to monitor the pod status and restart count of pods running in a GKE cluster with Stackdriver?

While I can see CPU, memory and disk usage metrics for all pods in Stackdriver there seems to be no way of getting metrics about crashing pods or pods in a replica set being restarted due to crashes.

I'm using a Kubernetes replica set to manage the pods, hence they are respawned and created with a new name when they crash. As far as I can tell the metrics in Stackdriver appear by pod-name (which is unique for the lifetime of the pod) which doesn't sound really sensible.

Alerting upon pod failures sounds like such a natural thing that it sounds hard to believe that this is not supported at the moment. The monitoring and alerting capabilities that I get from Stackdriver for Google Container Engine as they stand seem to be rather useless as they are all bound to pods whose lifetime can be very short.

So if this doesn't work out of the box are there known workarounds or best practices on how to monitor for continuously crashing pods?

852

asked May 04 '17 17:05

ctavan

2 Answers

There is a built in metric now, so it's easy to dashboard and/or alert on it without setting up custom metrics

Metric: kubernetes.io/container/restart_count Resource type: k8s_container

116

answered Oct 14 '22 10:10

dan carter

You can achieve this manually with the following:

In Logs Viewer, creating the following filter:

resource.labels.project_id="<PROJECT_ID>" resource.labels.cluster_name="<CLUSTER_NAME>" resource.labels.namespace_name="<NAMESPACE, or default>" jsonPayload.message:"failed liveness probe"

Create a metric by clicking on the Create Metric button above the filter input and filling up the details.
You may now track this metric in Stackdriver.

Would be happy to be informed of a built-in metric instead of this.

answered Oct 14 '22 10:10

Jonathan Lin

Related questions
                            
                                Canvas drawing takes a lot of time on Safari but not on Chrome or FF
                            
                                Property 'getReadableSchedule' is missing in type
                            
                                How to convert a html template to react? [closed]
                            
                                Creating Terrain Map with SRTM HGT File
                            
                                Where to put credentials.json in Android Studio
                            
                                Flutter: VSCode shortcut to have the list of @override
                            
                                Why is a member not getting zero-initialized in this example?
                            
                                redux-thunk: Property 'type' missing when calling action through store.dispatch()
                            
                                Managing user's global state on next.js application
                            
                                How to run a GitHub Action from a branch other than master?
                            
                                Should I share the Entity-Framework context or create a new context for each operation?
                            
                                Automating QA on Flex Application [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With