I'm running some applications on EC2 spot instances. Such instances can be killed by Amazon with no notice. In the shutdown process, processes are killed in some order. We have monitoring/recovery programs that should behave differently based on whether the server is shutting down or the process just crashed. (specifically we don't want to do anything if the server is actually shutting down) How can I detect in the recovery process (if it is still alive) that processes were killed because of a shutdown? (More system details: I'm running unknown/untrusted/etc code in a sandbox that doesn't modify external state. Generally if sandboxed code crashes, it is fault of author of the untrusted code and we will not rerun it. But if the sandboxed code is terminated due to the VM shuting down or failing, we need to rerun it on another instance. The problem I'm having right now is that the user's code is terminated first so the monitoring program incorrectly believes the crash is user error.)

How does your recovery process work? If you're using <code>waitpid</code> to monitor the process, when it exits you can determine: <ul> <li>Whether it exited normally, and what status the process returned if it did, or</li> <li>Whether it exited due to a signal, and what that signal was.</li> </ul> Depending on how the process is shut down, I'd expect to see it either exit normally or exit via <code>SIGTERM</code> or <code>SIGKILL</code>. <code>SIGILL</code>, <code>SIGABRT</code>, <code>SIGFPE</code>, <code>SIGBUS</code>, <code>SIGSEGV</code>, and <code>SIGSYS</code> would indicate a crash from a programming error.

How can my process detect if the computer is shutting down?

Tags:

linux

amazon-ec2

shutdown

I'm running some applications on EC2 spot instances. Such instances can be killed by Amazon with no notice.

In the shutdown process, processes are killed in some order. We have monitoring/recovery programs that should behave differently based on whether the server is shutting down or the process just crashed. (specifically we don't want to do anything if the server is actually shutting down)

How can I detect in the recovery process (if it is still alive) that processes were killed because of a shutdown?

(More system details: I'm running unknown/untrusted/etc code in a sandbox that doesn't modify external state. Generally if sandboxed code crashes, it is fault of author of the untrusted code and we will not rerun it. But if the sandboxed code is terminated due to the VM shuting down or failing, we need to rerun it on another instance. The problem I'm having right now is that the user's code is terminated first so the monitoring program incorrectly believes the crash is user error.)

341

asked May 21 '12 23:05

UsAaR33

2 Answers

agent

Run an agent on each machine that spawns sandbox child-processes. The agent runs your code that is "crash proof", and the sandbox code runs user code which could crash.

The monitoring system that is in charge of starting a new machine with a new sandbox process checks which processes have been killed (both the agent and sandbox process or only the sandbox child process).

It does that by opening a TCP connection (RMI/RPC/HTTP) to the agent querying about its child processes. If the agent responds - the machine is still running, and it can be asked about its child sandbox processes. If the agent does not respond - the machine is suspect of being terminated.

agent (variation)

The agent is also in charge of restarting the child sandbox process on the same VM in case it crashes.

lookup service

Use a look-up service (such as Zoo Keeper) to keep track of which processes sent heartbeat keep-alive. If the agent is alive then the machine is still running, if the agent is not alive, then it is not running.

ec2 api

Poll the EC2 APIs to determine if the machine is in running or terminated state.

167

answered Oct 05 '22 13:10

itaifrenkel

How does your recovery process work?

If you're using waitpid to monitor the process, when it exits you can determine:

Whether it exited normally, and what status the process returned if it did, or
Whether it exited due to a signal, and what that signal was.

Depending on how the process is shut down, I'd expect to see it either exit normally or exit via SIGTERM or SIGKILL. SIGILL, SIGABRT, SIGFPE, SIGBUS, SIGSEGV, and SIGSYS would indicate a crash from a programming error.

answered Oct 05 '22 15:10

LnxPrgr3

Related questions
                            
                                Java app with URLConnection leads "Too many open files"
                            
                                How to prevent inheriting CPU affinity by child forked process?
                            
                                Script to do incremental backups with rsync [closed]
                            
                                upside-down program in /usr/bin/games
                            
                                How to view vendor information of my NIC?
                            
                                Simple makefile for C/C++ targets used with arm-linux-gcc
                            
                                How to Stop a QThread That Runs a Blocking Forever Loop?
                            
                                Playing sound in C++ using native system calls?
                            
                                How are command-line GUIs made? [closed]
                            
                                Bash pipe and SIGTERM
                            
                                Calling a PHP script from a C++ Program
                            
                                Sign NSIS installer on Linux box
                            
                                How to read from user within while-loop read line?
                            
                                Is there a unix command line utilty for 'mapping' by line?
                            
                                How can I download and set the filenames using wget -i?
                            
                                How to captuare an IP packet, change its content and resend it on Linux?
                            
                                can not route packets from one interface to another [closed]
                            
                                Pyserial: could not configure port: (5, 'Input/output error)
                            
                                POSIX queues and msg_max
                            
                                Linking boost to shared library with CMake on Linux

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With