I have this ruby script to manage que processes. que doesn't support multi-proccess, see discussion here):
#!/usr/bin/env ruby
cluster_size = 2
puts "starting Que cluster with #{cluster_size} workers"; STDOUT.flush
%w[INT TERM].each do |signal|
trap(signal) do
@pids.each{|pid| Process.kill(signal, pid) }
end
end
@pids = []
cluster_size.to_i.times do |n|
puts "Starting Que daemon #{n}"; STDOUT.flush
@pids << Process.spawn("que --worker-count $MAX_THREADS")
end
Process.waitall
puts "Que cluster has shut down"; STDOUT.flush
The script has been working well for a couple months. The other day I found things in a state where the script was running, but both child processes were dead.
I experimented with trying to replicate this. I killed the children with various signals, had them raise exceptions. In all cases, the script knew the process died and itself died.
How could the child process have died without the parent script knowing?
The process is killed before printing message on the console and therefore the answer is YES, a process can kill itself.
When a parent process dies before a child process, the kernel knows that it's not going to get a wait call, so instead it makes these processes "orphans" and puts them under the care of init (remember mother of all processes).
Upon receiving the signal, the child's normal flow of execution is interrupted to run its handler, function2() . This updates the child's copy of variable counter , prints its value, and exit() s. then exits. So you mean even the function kill cannot kill the parent successfully.
A call to wait() blocks the calling process until one of its child processes exits or a signal is received. After child process terminates, parent continues its execution after wait system call instruction. Child process may terminate due to any of these: It calls exit();
How could the child process have died without the parent script knowing?
My guess is that the child process turned into a zombie and missed by Process.waitall
. Did you check if the child processes are zombies when it happens?
The zombie:
If you have zombie processes it means those zombies have not been waited for by their parent (check the PPID
with ps -l
). In the end you have three choices: Fix the parent process (make it wait); kill the parent; or get over it.
Could you check your list of signals and trap
it?
You can list all Signal(s) available (below is on windows):
Signal.list
=> {"EXIT"=>0, "INT"=>2, "ILL"=>4, "ABRT"=>22, "FPE"=>8, "KILL"=>9, "SEGV"=>11, "TERM"=>15}
Could you try to trap
it via e.g. INT
(note: you can have one trap per Signal) (
Signal.trap('SEGV') { throw :sigsegv }
catch :sigsegv
start_what_you_need
end
puts 'OMG! Got a SEGV!'
Since your question is a general one, it is hard to give you a specific answer.
Zombies are not the only possible cause for this problem -- stopped children may not be reported for a variety of reasons.
The existence of a zombie typically means that the parent has not properly waited on them. The posted code looks OK, though, so unless there's a framework bug lurking somewhere I'd want to look beyond the zombie apocalypse to explain this problem.
In contrast to zombies, which can't be fully reaped because they have no accessible parent, frozen processes have an intact parent but have stopped responding for some reason (waiting for an external process or I/O operation, memory problems, long or infinite looping, slow database operations, etc.).
On some platforms, Ruby can add a flag requesting return of stopped children that haven't been reported, using the following syntax:
waitpid(pid, Process::WUNTRACED)
AFAIK waitall
doesn't have a version that accepts flags, so you'd have to aggregate this yourself, or use pid = -1
to wait for any child process (the default if you omit pid) or pid = 0
to wait for any child with the same process groupID as the calling process.
See documentation here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With