How can I automatically kill idle GCE instances based on CPU usage?

Question

I'm running some slightly unreliable software on some instances in an instance group. The software is installed and run by a startup script, and most of the time it works without issue, but about ~10% of the new instances run out of memory and crash due to some sort of memory leak in the software. I can't get this leak fixed myself, so in the meantime, I've been checking the instances every few hours and killing any that show an idle CPU (the software consumes all available CPU power normally).

However, I'm using preemptible instances, and they can be killed off and restarted at any time, leaving dead instances running whenever I'm not actively monitoring them. After a day of leaving things unattended, I usually see ~80-85% CPU usage in the dashboard, the rest of which is wasted.

Is there any automated way I can kill off these dead instances? Restarting them is already handled by the instance group.

viswajithiii · Accepted Answer

The following worked for me. It's a bash script which uses the uptime UNIX command to check whether the 15-minute average load on the CPU is below a threshold, and automatically shuts down the system if this is true on ten consecutive checks. You need to run this within your VM instance.

Credit, and more detailed explanation: Rohit Rawat's blog.

#!/bin/bash
threshold=0.4

count=0
while true
do

  load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
  res=$(echo $load'<'$threshold | bc -l)
  if (( $res ))
  then
    echo "Idling.."
    ((count+=1))
  fi
  echo "Idle minutes count = $count"

  if (( count>10 ))
  then
    echo Shutting down
    # wait a little bit more before actually pulling the plug
    sleep 300
    sudo poweroff
  fi

  sleep 60

done

Michael Aaron Safyan · Answer

It seems like there are two parts to this question:

Identifying dead instances.
Killing off those instances.

In terms of identifying dead instances, one way to do this would be to have a separate, management instance that does not run this software and that keeps tabs on the other instances. For example, it could do this by periodically sending a health request to the various instances and marking non-responsive instances or instances reporting an overly high CPU usage as unhealthy.

Once your management instance has identified the unhealthy instances that need to be reset, you should be able to reset those other instances using the API (I'm guessing the reset command) or by executing the same operation using the gcloud commandline tool.

How can I automatically kill idle GCE instances based on CPU usage?

Tags:

google-compute-engine

James

2 Answers

viswajithiii

Michael Aaron Safyan

Recent Activity

Donate For Us

How can I automatically kill idle GCE instances based on CPU usage?

Tags:

google-compute-engine

James

2 Answers

viswajithiii

Michael Aaron Safyan

Related questions

Recent Activity

Donate For Us