I'm running some applications on EC2 spot instances. Such instances can be killed by Amazon with no notice.
In the shutdown process, processes are killed in some order. We have monitoring/recovery programs that should behave differently based on whether the server is shutting down or the process just crashed. (specifically we don't want to do anything if the server is actually shutting down)
How can I detect in the recovery process (if it is still alive) that processes were killed because of a shutdown?
(More system details: I'm running unknown/untrusted/etc code in a sandbox that doesn't modify external state. Generally if sandboxed code crashes, it is fault of author of the untrusted code and we will not rerun it. But if the sandboxed code is terminated due to the VM shuting down or failing, we need to rerun it on another instance. The problem I'm having right now is that the user's code is terminated first so the monitoring program incorrectly believes the crash is user error.)
Every time a user starts or shuts down a computer, an event log will be recorded in the Event Viewer. These event logs can be used to track computer active hours. To view these audit logs, go to the Event Viewer.
To turn off your PC in Windows 10, select the Start button, select the Power button, and then select Shut down.
Run an agent on each machine that spawns sandbox child-processes. The agent runs your code that is "crash proof", and the sandbox code runs user code which could crash.
The monitoring system that is in charge of starting a new machine with a new sandbox process checks which processes have been killed (both the agent and sandbox process or only the sandbox child process).
It does that by opening a TCP connection (RMI/RPC/HTTP) to the agent querying about its child processes. If the agent responds - the machine is still running, and it can be asked about its child sandbox processes. If the agent does not respond - the machine is suspect of being terminated.
The agent is also in charge of restarting the child sandbox process on the same VM in case it crashes.
Use a look-up service (such as Zoo Keeper) to keep track of which processes sent heartbeat keep-alive. If the agent is alive then the machine is still running, if the agent is not alive, then it is not running.
Poll the EC2 APIs to determine if the machine is in running or terminated state.
How does your recovery process work?
If you're using waitpid
to monitor the process, when it exits you can determine:
Depending on how the process is shut down, I'd expect to see it either exit normally or exit via SIGTERM
or SIGKILL
. SIGILL
, SIGABRT
, SIGFPE
, SIGBUS
, SIGSEGV
, and SIGSYS
would indicate a crash from a programming error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With