Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Only thread handling io_service is waiting even though async I/O operations are pending

Boost's ASIO dispatcher seems to have a serious problem, and I can't seem to find a workaround. The symptom is that the only thread waiting to dispatch is left in pthread_cond_wait feven though there are I/O operations pending that require it to block in epoll_wait.

I can most easily replicate this issue by having one thread call poll_one in a loop until it returns zero. This can leave the thread calling run stuck in pthread_cond_wait while the thread calling poll_one breaks out of the loop. Presumably, the io_service is expecting that thread to return to block in epoll_wait, but it's under no obligation to do so and that expectation seems fatal.

Is there a requirement that threads be statically associated with io_services?

Here's an example showing the deadlock. This is the only thread handling this io_service because the others have moved on. There are definitely socket operations pending:

#0 pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 boost::asio::detail::posix_event::wait<boost::asio::detail::scoped_lock<boost::asio::detail::posix_mutex> > (...) at /usr/include/boost/asio/detail/posix_event.hpp:80
#2 boost::asio::detail::task_io_service::do_run_one (...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:405
#3 boost::asio::detail::task_io_service::run (...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:146

I believe the bug is as follows: If a thread servicing an I/O queue is the thread that's blocking on the I/O socket readiness check and it calls to a dispatch function, if there are any other threads blocked on the io service, it must signal. It currently only signals if there are handlers ready to run at that time. But that leaves no thread checking for socket readiness.

like image 798
David Schwartz Avatar asked Mar 30 '13 01:03

David Schwartz


1 Answers

This is a bug. I have been able to duplicate it by adding a delay into the non-critical section of task_io_service::do_poll_one. Here is a snippet of the modified task_io_service::do_poll_one() in booost/asio/detail/impl/task_io_service.ipp. The only line added is the sleep.

std::size_t task_io_service::do_poll_one(mutex::scoped_lock& lock,
    task_io_service::thread_info& this_thread,
    const boost::system::error_code& ec)
{
  if (stopped_)
    return 0;

  operation* o = op_queue_.front();
  if (o == &task_operation_)
  {
    op_queue_.pop();
    lock.unlock();

    {
      task_cleanup c = { this, &lock, &this_thread };
      (void)c;

      // Run the task. May throw an exception. Only block if the operation
      // queue is empty and we're not polling, otherwise we want to return
      // as soon as possible.
      task_->run(false, this_thread.private_op_queue);
      boost::this_thread::sleep_for(boost::chrono::seconds(3));
    }

    o = op_queue_.front();
    if (o == &task_operation_)
      return 0;
  }

...

My test driver is fairly basic:

  • An asynchronous work loop via a timer that will print "." every 3 seconds.
  • Spawn off a single thread that will poll the io_service.
  • Delay to allow the new thread time to poll io_service, and have main call io_service::run() while the poll thread sleeps in task_io_service::do_poll_one().

Test code:

#include <iostream>

#include <boost/asio/io_service.hpp>
#include <boost/asio/steady_timer.hpp>
#include <boost/chrono.hpp>
#include <boost/thread.hpp>

boost::asio::io_service io_service;
boost::asio::steady_timer timer(io_service);

void arm_timer()
{
  std::cout << ".";
  std::cout.flush();
  timer.expires_from_now(boost::chrono::seconds(3));
  timer.async_wait(boost::bind(&arm_timer));
}

int main()
{
  // Add asynchronous work loop.
  arm_timer();

  // Spawn poll thread.
  boost::thread poll_thread(
    boost::bind(&boost::asio::io_service::poll, boost::ref(io_service)));

  // Give time for poll thread service reactor.
  boost::this_thread::sleep_for(boost::chrono::seconds(1));

  io_service.run();
}

And the debug:

[twsansbury@localhost bug]$ gdb a.out 
...
(gdb) r
Starting program: /home/twsansbury/dev/bug/a.out 

[Thread debugging using libthread_db enabled]
.[New Thread 0xb7feeb90 (LWP 31892)]
[Thread 0xb7feeb90 (LWP 31892) exited]

At this point, the arm_timer() has printed "." once (when it was intially armed). The poll thread serviced the reactor in a non-blocking manner, and slept for 3 seconds while op_queue_ was empty (task_operation_ will be added back to the op_queue_ when task_cleanup c exits scope). While the op_queue_ was empty, the main thread calls io_service::run(), sees the op_queue_ is empty, and makes itself the first_idle_thread_, where it waits on its wakeup_event. The poll thread finishes sleeping, and returns 0, leaving the main thread waiting on wakeup_event.

After waiting 10~ seconds, plenty of time for the arm_timer() to be ready, I interrupt the debugger:

Program received signal SIGINT, Interrupt.
0x00919402 in __kernel_vsyscall ()
(gdb) bt
#0  0x00919402 in __kernel_vsyscall ()
#1  0x0081bbc5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x00763b3d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3  0x08059dc2 in void boost::asio::detail::posix_event::wait >(boost::asio::detail::scoped_lock&) ()
#4  0x0805a009 in boost::asio::detail::task_io_service::do_run_one(boost::asio::detail::scoped_lock&, boost::asio::detail::task_io_service_thread_info&, boost::system::error_code const&) ()
#5  0x0805a11c in boost::asio::detail::task_io_service::run(boost::system::error_code&) ()
#6  0x0805a1e2 in boost::asio::io_service::run() ()
#7  0x0804db78 in main ()

The side-by-side timeline is as follows:

          poll thread                  |          main thread
---------------------------------------+---------------------------------------
  lock()                               | 
  do_poll_one()                        |                          
  |-- pop task_operation_ from         |
  |   queue_op_                        |
  |-- unlock()                         |  lock()
  |-- create task_cleanup              |  do_run_one()
  |-- service reactor (non-block)      |  `-- queue_op_ is empty
  |-- ~task_cleanup()                  |      |-- set thread as idle
  |   |-- lock()                       |      `-- unlock()
  |   `-- queue_op_.push(              |
  |       task_operation_)             |
  `-- task_operation_ is               | 
      queue_op_.front()                |
      `-- return 0                     |  // still waiting on wakeup_event
  unlock()                             |

As best as I could tell, there are no side effects by patching:

if (o == &task_operation_)
  return 0;

to:

if (o == &task_operation_)
{
  if (!one_thread_)
    wake_one_thread_and_unlock(lock);
  return 0;
}

Regardless, I have submitted a bug and fix. Consider keeping an eye on the ticket for an official response.

like image 83
Tanner Sansbury Avatar answered Nov 07 '22 15:11

Tanner Sansbury