Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What happens when the Kubernetes master fails?

I've been trying to figure out what happens when the Kubernetes master fails in a cluster that only has one master. Do web requests still get routed to pods if this happens, or does the entire system just shut down?

According to the OpenShift 3 documentation, which is built on top of Kubernetes, (https://docs.openshift.com/enterprise/3.2/architecture/infrastructure_components/kubernetes_infrastructure.html), if a master fails, nodes continue to function properly, but the system looses its ability to manage pods. Is this the same for vanilla Kubernetes?

like image 465
David Newswanger Avatar asked Aug 26 '16 17:08

David Newswanger


People also ask

What happens if master node fails?

The cluster will not be able to respond to node failures, create new resources, move pods to new nodes, etc.

What happens if Kubernetes node fails?

Irrespective of deployments (StatefuleSet or Deployment), Kubernetes will automatically evict the pod on the failed node and then try to recreate a new one with old volumes. If the node is back online within 5 – 6 minutes of the failure, Kubernetes will restart pods, unmount, and re-mount volumes.

Can Kubernetes work without master node?

No. As you can read in Kubernetes Components section: Master components provide the cluster's control plane.

What will happens if ETCD goes down?

The etcd cluster is considered failed if the majority of etcd members have permanently failed. After the etcd cluster failure, all running workload might continue operating. However due to etcd role, Kubernetes cannot make any changes to its current state.


1 Answers

In typical setups, the master nodes run both the API and etcd and are either largely or fully responsible for managing the underlying cloud infrastructure. When they are offline or degraded, the API will be offline or degraded.

In the event that they, etcd, or the API are fully offline, the cluster ceases to be a cluster and is instead a bunch of ad-hoc nodes for this period. The cluster will not be able to respond to node failures, create new resources, move pods to new nodes, etc. Until both:

  1. Enough etcd instances are back online to form a quorum and make progress (for a visual explanation of how this works and what these terms mean, see this page).
  2. At least one API server can service requests

In a partially degraded state, the API server may be able to respond to requests that only read data.

However, in any case, life for applications will continue as normal unless nodes are rebooted, or there is a dramatic failure of some sort during this time, because TCP/ UDP services, load balancers, DNS, the dashboard, etc. Should all continue to function for at least some time. Eventually, these things will all fail on different timescales. In single master setups or complete API failure, DNS failure will probably happen first as caches expire (on the order of minutes, though the exact timing is configurable, see the coredns cache plugin documentation). This is a good reason to consider a multi-master setup–DNS and service routing can continue to function indefinitely in a degraded state, even if etcd can no longer make progress.

There are actions that you could take as an operator which would accelerate failures, especially in a fully degraded state. For instance, rebooting a node would cause DNS queries and in fact probably all pod and service networking functionality until at least one master comes back online. Restarting DNS pods or kube-proxy would also be bad.

If you'd like to test this out yourself, I recommend kubeadm-dind-cluster, kind or, for more exotic setups, kubeadm on VMs or bare metal. Note: kubectl proxy will not work during API failure, as that routes traffic through the master(s).

like image 83
pnovotnak Avatar answered Sep 30 '22 23:09

pnovotnak