Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do my google cloud compute instances always unexpectedly restart?

Help! Help! Help!

It is really annoying and I almost cannot bear it anymore! I'm using google cloud compute engine instances but they often unexpectedly restart without any notification in advance. The restart of instances seems to happen randomly and I have no idea what's going wrong there! I'm pretty sure that the instances are been occupied (usage of CPUs > 50% and all GPUs are in use) when restart happens. Could anyone please tell me how to solve this problem? Thanks in advance!

like image 716
ROBOT AI Avatar asked Jan 27 '18 10:01

ROBOT AI


2 Answers

The issue is right here:

all GPUs are in use

If you check the official documentation about GPU:

GPU instances must terminate for host maintenance events, but can automatically restart. These maintenance events typically occur once per week, but can occur more frequently when necessary. You must configure your workloads to handle these maintenance events cleanly. Specifically, long-running workloads like machine learning and high-performance computing (HPC) must handle the interruption of host maintenance events. Learn how to handle host maintenance events on instances with GPUs.

This is because an instance that has a GPU attached cannot be migrated to another host for maintenance as it happens for the rest of the virtual machines. To get a physical GPU attached to the instance and bare metal performance you are using GPU passthrough , which sadly means if the host has to go through maintenance the VM is going down with it.

like image 72
DevopsTux Avatar answered Oct 04 '22 09:10

DevopsTux


This sounds like Preemptible VM instance.

Preemptible instances function like normal instances, but have the following limitations:

  • Compute Engine might terminate preemptible instances at any time due to system events. The probability that Compute Engine will terminate a preemptible instance for a system event is generally low, but might vary from day to day and from zone to zone depending on current conditions.
  • Compute Engine always terminates preemptible instances after they run for 24 hours.

To check if your instance is preemptible using gcloud cli, just run

gcloud compute instances describe instance-name --format="(scheduling.preemptible)"

Result

scheduling:
  preemptible: false

change "instance-name" to real name.

Or simply via UI, click on compute instance and scroll down: enter image description here

To check for system operations performed on your instance, you can review it using following command:

gcloud compute operations list 
like image 39
rkosegi Avatar answered Oct 04 '22 10:10

rkosegi