Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect system ACPI G2/S5 Soft Off event with python on linux

I am working on an app using Google's compute engine and would like to use pre-emptible instances.

I need my code to respond to the 30s warning google gives via an ACPI G2 Soft Off signal that they send when they are going to take away your VM as described here: https://cloud.google.com/compute/docs/instances/preemptible.

How do I detect this event in my python code that is running on the machine and react to it accordingly (in my case I need to put the job the VM was working on back on a queue of open jobs so that a different machine can take it).

like image 487
asutherland Avatar asked Nov 30 '15 18:11

asutherland


1 Answers

I am not answering the question directly, but I think that your actual intent is different:

  • The G2 power button event is generated by both preemption of a VM and the gcloud instances stop command (or the corresponding API, which it calls);
  • I am assuming that you want to react specially only on instance preemption.

Avoid a common misunderstanding

GCE does not send a "30s termination warning" with the power button event. It just sends the normal, honest power button soft-off event that immediately initiates shutdown of the system.

The "warning" part that comes with it is simple: “Here is your power button event, shutdown the OS ASAP, because you have 30s before we pull the plug off the wall socket. You've been warned!”

You have two system services that you can combine in different ways to get the desired behavior.

1. Use the fact that the system is shutting down upon ACPI G2

The most kosher (and, AFAIK, the only supported) way of handling the ACPI power button event is let the system handle it, and execute what you want in the instance shutdown script. In a systemd-managed machine, the default GCP shutdown script is simply invoked by a Type=oneshot service's ExecStop= command (see systemd.service(8)). The script is ran relatively late in shutdown sequence.

If you must ensure that the shutdown script is ran after (or before) some of your services is sent a signal to terminate, you can modify some of service dependencies. Things to keep in mind:

  • After and Before are reversed on shutdown: if X is started after Y, then it's stopped before Y.
  • The After dependency ensures that the service in the sequence is told to terminate before the shutdown script is run. It does not ensure that the service has already terminated.
  • The shutdown script is run when the google-shutdown-scripts.service is stopped as part of system shutdown.

With all that in mind, you can do sudo systemctl edit google-shutdown-scripts.service. This will create an empty configuration override file and open your $EDITOR, where you can put your After and Before dependencies, for example,

[Unit]
# Make sure that shutdown script is run (synchronously) *before* mysvc1.service is stopped.
After=mysvc1.service
# Make sure that mysvc2.service is sent a command to stop before the shutdown script is run
Before=mysvc2.service

You may specify as many After or Before clauses as you want, 0 or more of each. Read systemd.unit(8) for more information.

2. Use GCP metadata

There is an instance metadatum v1/instance/preempted. If the instance is preempted, it's value is TRUE, otherwise it's FALSE.

GCP has a thorough documentation on working with instance metadata. In short, there are two ways you can use this (or any other) metadata value:

  1. Query its value at any time, e. g. in the shutdown script. curl(1) equivalent:

    curl -sfH 'Metadata-Flavor: Google' \
      'http://169.254.169.254/computeMetadata/v1/instance/preempted'
    
  2. Run an HTTP request that will complete (200) when the metadatum changes. The only change that can ever happen to it is from FALSE to TRUE, as preemption is irreversible.

    curl -sfH 'Metadata-Flavor: Google' \
      'http://169.254.169.254/computeMetadata/v1/instance/preempted?wait_for_change=true'
    

Caveat: The metadata server may return the 503 response if it's temporarily unavailable (this is very rare, but happens), so certain retry logic is required. This especially true for the long-running second form (with ?wait_for_change=true), as the pending request may return at any time with the code 503. Your code should be ready to handle this and restart the query. curl does not return the HTTP error code directly, but you can use the fact that x=$(curl ....) expression returned an empty string if you scripting it; your criterion for positive detection of preemption is [[ $x == TRUE ]] in this case.

Summary

  • If you want to detect that the VM is shutting down for any reason, use Google-provided shutdown script.
    • If you also need to distinguish whether the VM was in fact preempted, as opposed to gcloud instance stop <vmname> (which also sends the power button event!), query the preempted metadata in the shutdown script.
  • Run a pending HTTP request for metadata change, and react on it accordingly. This will complete successfully when VM is preempted only (but may complete with an error at any time too).
  • If the daemon that you run is your own, you can also directly query the preempted metadata from the code path which handles the termination signal, if you need to distinguish between different shutdown reasons.

It is not impossible that the real decision point is whether you have an "active job" that you want to return to the "queue", or not: if your service is requested to stop while holding on an active job, just return it, regardless of the reason why you are being stopped. But I cannot comment on this, not knowing your actual design.

like image 99
kkm Avatar answered Sep 19 '22 14:09

kkm