Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What causes this deadlock in my Ruby `trap` block?

I'm reading through Jesse Storimer's excellent book, Working with Unix Processes. In a section on trapping signals from child processes which have exited, he gives a code sample.

I have modified that code slightly (see below) to get a bit more visibility into what's happening:

  • That the parent resumes its own execution between signals (which I can see with its puts),
  • That the wait executes for multiple children in one trap statement (sometimes I get "Received a CHLD signal" once followed by multiple "child pid exited").

Expected Output

Normally the output from the code below looks something like:

parent is working hard
Received a CHLD signal
child pid 73408 exited
parent is working hard
parent is working hard
parent is working hard
Received a CHLD signal
child pid 73410 exited
child pid 73409 exited
All children exited - parent exiting too.

The occasional error

But once in a while I get an error like this:

trapping_signals.rb:17:in `write': deadlock; recursive locking (ThreadError)
    from trapping_signals.rb:17:in `puts'
    from trapping_signals.rb:17:in `puts'
    from trapping_signals.rb:17:in `block in <main>'
    from trapping_signals.rb:17:in `call'
    from trapping_signals.rb:17:in `write'
    from trapping_signals.rb:17:in `puts'
    from trapping_signals.rb:17:in `puts'
    from trapping_signals.rb:17:in `block in <main>'
    from trapping_signals.rb:40:in `call'
    from trapping_signals.rb:40:in `sleep'
    from trapping_signals.rb:40:in `block in <main>'
    from trapping_signals.rb:38:in `loop'
    from trapping_signals.rb:38:in `<main>

Can anyone explain to me what's going wrong here?

The code

child_processes = 3
dead_processes = 0

# We fork 3 child processes.
child_processes.times do
  fork do
    # Each sleeps between 0 and 5 seconds
    sleep rand(5)
  end
end

# Our parent process will be busy doing some work.
# But still wants to know when one of its children exits.

# By trapping the :CHLD signal our process will be notified by the kernel
# when one of its children exits.
trap(:CHLD) do
  puts "Received a CHLD signal"
  # Since Process.wait queues up any data that it has for us we can ask for it
  # here, since we know that one of our child processes has exited.

  # We loop over a non-blocking Process.wait to ensure that any dead child
  # processes are accounted for.
  # Here we wait without blocking.
  while pid = Process.wait(-1, Process::WNOHANG)
    puts "child pid #{pid} exited"
    dead_processes += 1

    # We exit ourselves once all the child processes are accounted for.
    if dead_processes == child_processes
      puts "All children exited - parent exiting too."
      exit
    end
  end
end

# Work it.
loop do
  puts "parent is working hard"
  sleep 1
end
like image 883
Nathan Long Avatar asked May 18 '12 15:05

Nathan Long


1 Answers

I looked through Ruby sources to see where that particular error is raised, and it's only ever raised when the current thread tries to acquire a lock, but that same lock is already taken by the current thread. This implies that locking is not re-entrant:

m = Mutex.new
m.lock
m.lock #=> same error as yours

Now at least we know what happens, but not yet why and where. The error message indicates that it happens during the call to puts. When it gets called, it finally ends up in io_binwrite. stdout is not synchronized, but it is buffered, so this if condition is fulfilled on the first call, and a buffer plus a write lock for that buffer will be set up. The write lock is important to guarantee the atomicity of writes to stdout, it shouldn't happen that two threads simultaneously writing to stdout mix up each other's output. To demonstrate what I mean:

t1 = Thread.new { 100.times { print "aaaaa" } }
t2 = Thread.new { 100.times { print "bbbbb" } }
t1.join
t2.join

Although both threads take turns in writing to stdout, it will never happen that a single write is broken up - you will always have the full 5 a's or b's in sequence. That's what the write lock is there for.

Now what goes wrong in your case is a race condition on that write lock. The parent process loops and writes to stdout every second("parent is working hard"). But the same thread also eventually executes the trap block and tries again to write to stdout("Received a CHLD signal"). You can verify that it's really the same thread by adding #{Thread.current} in your puts statements. If those two events happen closely enough, then you will have the same situation as in the first example: the same thread trying to obtain the same lock twice, and this ultimately triggers the error.

like image 52
emboss Avatar answered Sep 26 '22 01:09

emboss