boost::asio reasoning behind num_implementations for io_service::strand

Question

We've been using asio in production for years now and recently we have reached a critical point when our servers become loaded just enough to notice a mysterious issue.

In our architecture, each separate entity that runs independently uses a personal strand object. Some of the entities can perform a long work (reading from file, performing MySQL request, etc). Obviously, the work is performed within handlers wrapped with strand. All sounds nice and pretty and should work flawlessly, until we have begin to notice an impossible things like timers expiring seconds after they should, even though threads are 'waiting for work' and work being halt for no apparent reason. It looked like long work performed inside a strand had impact on other unrelated strands, not all of them, but most.

Countless hours were spent to pinpoint the issue. The track has led to the way strand object is created: strand_service::construct (here).

For some reason developers decided to have a limited number of strand implementations. Meaning that some totally unrelated objects will share a single implementation and hence will be bottlenecked because of this.

In the standalone (non-boost) asio library similar approach is being used. But instead of shared implementations, each implementation is now independent but may share a mutex object with other implementations (here).

What is it all about? I have never heard of limits on number of mutexes in the system. Or any overhead related to their creation/destruction. Though the last problem could be easily solved by recycling mutexes instead of destroying them.

I have a simplest test case to show how dramatic is a performance degradation:

#include <boost/asio.hpp>
#include <atomic>
#include <functional>
#include <iostream>
#include <thread>

std::atomic<bool> running{true};
std::atomic<int> counter{0};

struct Work
{
    Work(boost::asio::io_service & io_service)
        : _strand(io_service)
    { }

    static void start_the_work(boost::asio::io_service & io_service)
    {
        std::shared_ptr<Work> _this(new Work(io_service));

        _this->_strand.get_io_service().post(_this->_strand.wrap(std::bind(do_the_work, _this)));
    }

    static void do_the_work(std::shared_ptr<Work> _this)
    {
        counter.fetch_add(1, std::memory_order_relaxed);

        if (running.load(std::memory_order_relaxed)) {
            start_the_work(_this->_strand.get_io_service());
        }
    }

    boost::asio::strand _strand;
};

struct BlockingWork
{
    BlockingWork(boost::asio::io_service & io_service)
        : _strand(io_service)
    { }

    static void start_the_work(boost::asio::io_service & io_service)
    {
        std::shared_ptr<BlockingWork> _this(new BlockingWork(io_service));

         _this->_strand.get_io_service().post(_this->_strand.wrap(std::bind(do_the_work, _this)));
    }

    static void do_the_work(std::shared_ptr<BlockingWork> _this)
    {
        sleep(5);
    }

    boost::asio::strand _strand;
};


int main(int argc, char ** argv)
{
    boost::asio::io_service io_service;
    std::unique_ptr<boost::asio::io_service::work> work{new boost::asio::io_service::work(io_service)};

    for (std::size_t i = 0; i < 8; ++i) {
        Work::start_the_work(io_service);
    }

    std::vector<std::thread> workers;

    for (std::size_t i = 0; i < 8; ++i) {
        workers.push_back(std::thread([&io_service] {
            io_service.run();
        }));
    }

    if (argc > 1) {
        std::cout << "Spawning a blocking work" << std::endl;
        workers.push_back(std::thread([&io_service] {
            io_service.run();
        }));
        BlockingWork::start_the_work(io_service);
    }

    sleep(5);
    running = false;
    work.reset();

    for (auto && worker : workers) {
        worker.join();
    }

    std::cout << "Work performed:" << counter.load() << std::endl;
    return 0;
}

Build it using this command:

g++ -o asio_strand_test_case -pthread -I/usr/include -std=c++11 asio_strand_test_case.cpp -lboost_system

Test run in a usual way:

time ./asio_strand_test_case 
Work performed:6905372

real    0m5.027s
user    0m24.688s
sys     0m12.796s

Test run with a long blocking work:

time ./asio_strand_test_case 1
Spawning a blocking work
Work performed:770

real    0m5.031s
user    0m0.044s
sys     0m0.004s

Difference is dramatic. What happens is each new non-blocking work creates a new strand object up until it shares the same implementation with strand of the blocking work. When this happens it's a dead-end, until long work finishes.

Edit: Reduced parallel work down to the number of working threads (from 1000 to 8) and updated test run output. Did this because when both numbers are close the issue is more visible.

Arunmu · Accepted Answer

Well, an interesting issue and +1 for giving us a small example reproducing the exact issue.

The problem you are having 'as I understand' with the boost implementation is that, it by default instantiates only a limited number of strand_impl, 193 as I see in my version of boost (1.59).

Now, what this means is that a large number of requests will be in contention as they would be waiting for the lock to be unlocked by the other handler (using the same instance of strand_impl).

My guess for doing such a thing would be to disallow overloading the OS by creating lots and lots and lots of mutexes. That would be bad. The current implementation allows one to reuse the locks (and in a configurable way as we will see below)

In my setup:

MacBook-Pro:asio_test amuralid$ g++ -std=c++14 -O2 -o strand_issue strand_issue.cc -lboost_system -pthread
MacBook-Pro:asio_test amuralid$ time ./strand_issue
Work performed:489696

real    0m5.016s
user    0m1.620s
sys 0m4.069s
MacBook-Pro:asio_test amuralid$ time ./strand_issue 1
Spawning a blocking work
Work performed:188480

real    0m5.031s
user    0m0.611s
sys 0m1.495s

Now, there is a way to change this number of cached implementations by setting the Macro BOOST_ASIO_STRAND_IMPLEMENTATIONS.

Below is the result I got after setting it to a value of 1024:

MacBook-Pro:asio_test amuralid$ g++ -std=c++14 -DBOOST_ASIO_STRAND_IMPLEMENTATIONS=1024 -o strand_issue strand_issue.cc -lboost_system -pthread
MacBook-Pro:asio_test amuralid$ time ./strand_issue
Work performed:450928

real    0m5.017s
user    0m2.708s
sys 0m3.902s
MacBook-Pro:asio_test amuralid$ time ./strand_issue 1
Spawning a blocking work
Work performed:458603

real    0m5.027s
user    0m2.611s
sys 0m3.902s

Almost the same for both cases! You might want to adjust the value of the macro as per your needs to keep the deviation small.

boost::asio reasoning behind num_implementations for io_service::strand

Tags:

c++

boost-asio

GreenScape

1 Answers

Arunmu

Recent Activity

Donate For Us

boost::asio reasoning behind num_implementations for io_service::strand

Tags:

c++

boost-asio

GreenScape

1 Answers

Arunmu

Related questions

Recent Activity

Donate For Us