Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the recommended way to replace a bad GKE node instance?

Using gcloud container clusters resize I can easily scale up and down a cluster. However I find no way to target a specific compute instance vm for removal when resizing down.

Scenario: Our Compute Engine logs indicate that one instance suffers from failure to dismount a volume, from a Kubernetes pod that is since long gone. The cluster is appropriately sized, and the malfunctioning node serves containers properly but is on maximum CPU load.

Obviously I'd want a new Kubernetes node to be ready before I kill off the old one. Is it safe to simply resize up and then delete the instance using gcloud compute, or is there some container-aware way to do this?

like image 358
solsson Avatar asked Apr 19 '16 07:04

solsson


4 Answers

Based on the previous answers, I created this shell script

#!/bin/bash

function gcp_delete_node_instance() {
  (
    local node="$1"
    set -e
    echo "Cordoning $node"
    kubectl cordon "$node"
    echo "Draining $node"
    kubectl drain "$node" --force --ignore-daemonsets
    zone="$(kubectl get node "$node" -o jsonpath='{.metadata.labels.topology\.gke\.io/zone}')"
    instance_group=$(gcloud compute instances describe --zone="$zone" --format='value[](metadata.items.created-by)' "$node")
    instance_group="${instance_group##*/}"
    echo "Deleting instance for node '$node' in zone '$zone' instance group '$instance_group'"
    gcloud compute instance-groups managed delete-instances --instances="$node" --zone="$zone" "$instance_group"
    echo "Deleting instance for node '$node' completed."
  )
}

gcp_delete_node_instance "$1"

GCP automatically creates a replacement node and adds it to the pool.

like image 99
Lari Hotari Avatar answered Nov 18 '22 10:11

Lari Hotari


We use multi-zone clusters now which means I needed a new way to get the instance group name. Current shell commands:

BAD_INSTANCE=[your node name from kubectl get nodes]

kubectl cordon $BAD_INSTANCE

kubectl drain $BAD_INSTANCE

gcloud compute instances describe --format='value[](metadata.items.created-by)' $BAD_INSTANCE

gcloud compute instance-groups managed delete-instances --instances=$BAD_INSTANCE --zone=[from describe output] [grp from describe output]
like image 38
solsson Avatar answered Nov 18 '22 10:11

solsson


However I find no way to target a specific compute instance vm for removal when resizing down.

There isn't a way to specify which VM to remove using the GKE API, but you can use the managed instance groups API to delete individual instances from the group (this will shrink your number of nodes by the number of instances that you delete, so if you want to replace the nodes, you will then want to scale your cluster up to compensate). You can find the instance group name by running:

$ gcloud container clusters describe CLUSTER | grep instanceGroupManagers

Is it safe to simply resize up and then delete the instance using gcloud compute, or is there some container-aware way to do this?

If you delete an instance, the managed instance group will replace it with a new one (so this will leave you with an extra node if you scale up by one, then delete the troublesome instance). If you were not concerned about the temporary loss of capacity, you could just delete the VM and let it get recreated.

Before removing an instance, you can run kubectl drain to remove the workload from the instance. This will result in a faster rescheduling of pods than if you simply deleting the instance and wait for the controllers to notice that it is gone.

like image 2
Robert Bailey Avatar answered Nov 18 '22 09:11

Robert Bailey


You can recreate the bad node with the following command:

gcloud compute instance-groups managed recreate-instances \
  --instances="$BAD_INSTANCE" \
  --zone="$ZONE" \
  "$INSTANCE_GROUP"

For example:

#!/bin/bash
set -e

BAD_INSTANCE="$1"
FULL_ZONE=$(gcloud compute instances describe --format='value[](zone)' "$BAD_INSTANCE")
FULL_INSTANCE_GROUP=$(gcloud compute instances describe --format='value[](metadata.items.created-by)' "$BAD_INSTANCE")
ZONE=${FULL_ZONE##*/}
INSTANCE_GROUP=${FULL_INSTANCE_GROUP##*/}

echo "Recreating node '$BAD_INSTANCE' from zone '$ZONE' in instance group '$INSTANCE_GROUP'"
sleep 10

gcloud compute instance-groups managed recreate-instances \
  --instances="$BAD_INSTANCE" \
  --zone="$ZONE" \
  "$INSTANCE_GROUP"
like image 2
Emil Vikström Avatar answered Nov 18 '22 09:11

Emil Vikström