How to make GCE instance stop when its deployed container finishes?

I have a Docker container that performs a single large computation. This computation requires lots of memory and takes about 12 hours to run.

I can create a Google Compute Engine VM of the appropriate size and use the "Deploy a container image to this VM instance" option to run this job perfectly. However once the job is finished the container quits but the VM is still running (and charging).

How can I make the VM exit/stop/delete when the container exits?

When the VM is in its zombie mode only the stackdriver containers are left running:

$ docker ps
CONTAINER ID        IMAGE                                                                COMMAND                  CREATED             STATUS              PORTS               NAMES
bfa2feb03180        gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1   "/entrypoint.sh /u..."   17 hours ago        Up 17 hours                             stackdriver-logging-agent
161439a487c2        gcr.io/stackdriver-agents/stackdriver-metadata-agent:0.2-0.0.17-2    "/bin/sh -c /opt/s..."   17 hours ago        Up 17 hours         8000/tcp            stackdriver-metadata-agent

I create the VM like this:

gcloud beta compute --project=abc instances create-with-container vm-name \
                    --zone=us-central1-c --machine-type=custom-1-65536-ext \
                    --network=default --network-tier=PREMIUM --metadata=google-logging-enabled=true \
                    --maintenance-policy=MIGRATE \
                    --service-account=xyz \
                    --scopes=https://www.googleapis.com/auth/cloud-platform \
                    --image=cos-stable-69-10895-71-0 --image-project=cos-cloud --boot-disk-size=10GB \
                    --boot-disk-type=pd-standard --boot-disk-device-name=vm-name \
                    --container-image=gcr.io/abc/my-image --container-restart-policy=on-failure \
                    --container-command=python3 \
                    --container-arg="a" --container-arg="b" --container-arg="c" \
People also ask

How do I stop a GCP instance?

To stop a VM, use the Google Cloud console, the gcloud CLI, or the Compute Engine API. In the Google Cloud console, go to the VM instances page. Select one or more VMs that you want to stop. Click Stop.

Does GCP charge for stopped instances?

A stopped instance does not incur charges, but all of the resources that are attached to the instance will still be charged. For example, you are charged for persistent disks and external IP addresses according to the price sheet, even if an instance is stopped.

What is a GCE instance?

Compute Engine instance is a virtual machine (VM) hosted on Google's infrastructure. Compute Engine instances can run the public images for Linux and Windows Server that Google provides as well as private custom images created or imported from existing systems.

4 Answers

When you create the VM, you'll need to give it write access to compute so you can delete the instance from within. You should also set container environment variables like gce_zone and gce_project_id at this time. You'll need them to delete the instance.

gcloud beta compute instances create-with-container {NAME} \
    --container-env=gce_zone={ZONE},gce_project_id={PROJECT_ID} \
    --service-account={SERVICE_ACCOUNT} \

Then within the container, whenever YOU determine your task is finished:

  1. request an api token (im using curl for simplicity and DEFAULT gce service account)
curl "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H "Metadata-Flavor: Google"

This will respond with json that looks like

  "access_token": "foobarbaz...",
  "expires_in": 1234,
  "token_type": "Bearer"
  1. Take that access token and hit the instances.delete api endpoint (notice the environment variables)
curl -XDELETE -H 'Authorization: Bearer {TOKEN}' https://www.googleapis.com/compute/v1/projects/$gce_project_id/zones/$gce_zone/instances/$HOSTNAME
Having grappled with the problem for some time, here's a full solution that works pretty well.

This solution doesn't use the "start machine with a container image" option. Instead it uses a startup script, which is more flexible. You still use a Container-Optimized OS instance.

  1. Create a startup script:
#!/usr/bin/env bash

# get image name and container parameters from the metadata
IMAGE_NAME=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/image_name -H "Metadata-Flavor: Google")

CONTAINER_PARAM=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/container_param -H "Metadata-Flavor: Google")

# This is needed if you are using a private images in GCP Container Registry
# (possibly also for the gcp log driver?)
sudo HOME=/home/root /usr/bin/docker-credential-gcr configure-docker

# Run! The logs will go to stack driver 
sudo HOME=/home/root  docker run --log-driver=gcplogs ${IMAGE_NAME} ${CONTAINER_PARAM}

# Get the zone
zoneMetadata=$(curl "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor:Google")
# Split on / and get the 4th element to get the actual zone name

# Run compute delete on the current instance. Need to run in a container 
# because COS machines don't come with gcloud installed 
docker run --entrypoint "gcloud" google/cloud-sdk:alpine compute instances delete ${HOSTNAME}  --delete-disks=all --zone=${ZONE}
  1. Put the script somewhere public. For example put it on Cloud Storage and create a public URL. You can't use a gs:// URI for a COS startup script.

  2. Start an instance using a startup-script-url, and passing the image name and parameters, e.g.:

gcloud compute --project=PROJECT_NAME instances create INSTANCE_NAME  \
--zone=ZONE --machine-type=TYPE \
container_param="PARAM1 PARAM2 PARAM3",\
startup-script-url=PUBLIC_SCRIPT_URL \
--maintenance-policy=MIGRATE --service-account=SERVICE_ACCUNT \
--scopes=https://www.googleapis.com/auth/cloud-platform --image-family=cos-stable \
--image-project=cos-cloud --boot-disk-size=10GB --boot-disk-device-name=DISK_NAME

(You probably want to limit the scopes, the example uses full access for simplicity)

I wrote a self-contained Python function based on Vincent's answer.

def kill_vm():
    If we are running inside a GCE VM, kill it.
    # based on https://stackoverflow.com/q/52748332/321772
    import json
    import logging
    import requests

    # get the token
    r = json.loads(
                     headers={"Metadata-Flavor": "Google"})

    token = r["access_token"]

    # get instance metadata
    # based on https://cloud.google.com/compute/docs/storing-retrieving-metadata
    project_id = requests.get("http://metadata.google.internal/computeMetadata/v1/project/project-id",
                              headers={"Metadata-Flavor": "Google"}).text

    name = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/name",
                        headers={"Metadata-Flavor": "Google"}).text

    zone_long = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/zone",
                             headers={"Metadata-Flavor": "Google"}).text
    zone = zone_long.split("/")[-1]

    # shut ourselves down
    logging.info("Calling API to delete this VM, {zone}/{name}".format(zone=zone, name=name))

                    .format(project_id=project_id, zone=zone, name=name),
                    headers={"Authorization": "Bearer {token}".format(token=token)})

A simple atexit hook gets me my desired behavior:

import atexit
The simplest way, from within the container, once it's finished:

ZONE=`gcloud compute instances list --filter="name=($HOSTNAME)" --format 'csv[no-heading](zone)'`

gcloud compute instances delete $HOSTNAME --zone=$ZONE -q

-q skips the interactive confirmation

$HOSTNAME is already exported

