I'm reading through Jesse Storimer's excellent book, Working with Unix Processes. In a section on trapping signals from child processes which have exited, he gives a code sample.
I have modified that code slightly (see below) to get a bit more visibility into what's happening:
puts
), wait
executes for multiple children in one trap
statement (sometimes I get "Received a CHLD signal" once followed by multiple "child pid exited").Normally the output from the code below looks something like:
parent is working hard
Received a CHLD signal
child pid 73408 exited
parent is working hard
parent is working hard
parent is working hard
Received a CHLD signal
child pid 73410 exited
child pid 73409 exited
All children exited - parent exiting too.
But once in a while I get an error like this:
trapping_signals.rb:17:in `write': deadlock; recursive locking (ThreadError)
from trapping_signals.rb:17:in `puts'
from trapping_signals.rb:17:in `puts'
from trapping_signals.rb:17:in `block in <main>'
from trapping_signals.rb:17:in `call'
from trapping_signals.rb:17:in `write'
from trapping_signals.rb:17:in `puts'
from trapping_signals.rb:17:in `puts'
from trapping_signals.rb:17:in `block in <main>'
from trapping_signals.rb:40:in `call'
from trapping_signals.rb:40:in `sleep'
from trapping_signals.rb:40:in `block in <main>'
from trapping_signals.rb:38:in `loop'
from trapping_signals.rb:38:in `<main>
Can anyone explain to me what's going wrong here?
child_processes = 3
dead_processes = 0
# We fork 3 child processes.
child_processes.times do
fork do
# Each sleeps between 0 and 5 seconds
sleep rand(5)
end
end
# Our parent process will be busy doing some work.
# But still wants to know when one of its children exits.
# By trapping the :CHLD signal our process will be notified by the kernel
# when one of its children exits.
trap(:CHLD) do
puts "Received a CHLD signal"
# Since Process.wait queues up any data that it has for us we can ask for it
# here, since we know that one of our child processes has exited.
# We loop over a non-blocking Process.wait to ensure that any dead child
# processes are accounted for.
# Here we wait without blocking.
while pid = Process.wait(-1, Process::WNOHANG)
puts "child pid #{pid} exited"
dead_processes += 1
# We exit ourselves once all the child processes are accounted for.
if dead_processes == child_processes
puts "All children exited - parent exiting too."
exit
end
end
end
# Work it.
loop do
puts "parent is working hard"
sleep 1
end
I looked through Ruby sources to see where that particular error is raised, and it's only ever raised when the current thread tries to acquire a lock, but that same lock is already taken by the current thread. This implies that locking is not re-entrant:
m = Mutex.new
m.lock
m.lock #=> same error as yours
Now at least we know what happens, but not yet why and where. The error message indicates that it happens during the call to puts
. When it gets called, it finally ends up in io_binwrite. stdout
is not synchronized, but it is buffered, so this if condition is fulfilled on the first call, and a buffer plus a write lock for that buffer will be set up. The write lock is important to guarantee the atomicity of writes to stdout, it shouldn't happen that two threads simultaneously writing to stdout
mix up each other's output. To demonstrate what I mean:
t1 = Thread.new { 100.times { print "aaaaa" } }
t2 = Thread.new { 100.times { print "bbbbb" } }
t1.join
t2.join
Although both threads take turns in writing to stdout
, it will never happen that a single write is broken up - you will always have the full 5 a's or b's in sequence. That's what the write lock is there for.
Now what goes wrong in your case is a race condition on that write lock. The parent process loops and writes to stdout
every second("parent is working hard"). But the same thread also eventually executes the trap
block and tries again to write to stdout
("Received a CHLD signal"). You can verify that it's really the same thread by adding #{Thread.current}
in your puts
statements. If those two events happen closely enough, then you will have the same situation as in the first example: the same thread trying to obtain the same lock twice, and this ultimately triggers the error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With