Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++: Thread pool slower than single threading?

First of all I did look at the other topics on this website and found they don't relate to my problem as those mostly deal with people using I/O operations or thread creation overheads. My problem is that my threadpool or worker-task structure implementation is (in this case) a lot slower than single threading. I'm really confused by this and not sure if it's the ThreadPool, the task itself, how I test it, the nature of threads or something out of my control.

// Sorry for the long code
#include <vector>
#include <queue>

#include <thread>
#include <mutex>
#include <future>

#include "task.hpp"

class ThreadPool
{
public:
    ThreadPool()
    {
        for (unsigned i = 0; i < std::thread::hardware_concurrency() - 1; i++)
            m_workers.emplace_back(this, i);

        m_running = true;
        for (auto&& worker : m_workers)
            worker.start();
    }
    ~ThreadPool()
    {
        m_running = false;
        m_task_signal.notify_all();
        for (auto&& worker : m_workers)
            worker.terminate();
    }

    void add_task(Task* task)
    {
        {
            std::unique_lock<std::mutex> lock(m_in_mutex);
            m_in.push(task);
        }
        m_task_signal.notify_one();
    }
private:
    class Worker
    {
    public:
        Worker(ThreadPool* parent, unsigned id) : m_parent(parent), m_id(id)
        {}
        ~Worker()
        {
            terminate();
        }

        void start()
        {
            m_thread = new std::thread(&Worker::work, this);
        }
        void terminate()
        {
            if (m_thread)
            {
                if (m_thread->joinable())
                {
                    m_thread->join();
                    delete m_thread;
                    m_thread = nullptr;
                    m_parent = nullptr;
                }
            }
        }
    private:
        void work()
        {
            while (m_parent->m_running)
            {               
                std::unique_lock<std::mutex> lock(m_parent->m_in_mutex);
                m_parent->m_task_signal.wait(lock, [&]()
                {
                    return !m_parent->m_in.empty() || !m_parent->m_running;
                });

                if (!m_parent->m_running) break;
                Task* task = m_parent->m_in.front();
                m_parent->m_in.pop();
                // Fixed the mutex being locked while the task is executed
                lock.unlock();

                task->execute();            
            }
        }
    private:
        ThreadPool* m_parent = nullptr;
        unsigned m_id = 0;

        std::thread* m_thread = nullptr;
    };
private:
    std::vector<Worker> m_workers;

    std::mutex m_in_mutex;
    std::condition_variable m_task_signal;
    std::queue<Task*> m_in;

    bool m_running = false;
};

class TestTask : public Task
{
public:
    TestTask() {}
    TestTask(unsigned number) : m_number(number) {}

    inline void Set(unsigned number) { m_number = number; }

    void execute() override
    {
        if (m_number <= 3)
        {
            m_is_prime = m_number > 1;
            return;
        }
        else if (m_number % 2 == 0 || m_number % 3 == 0)
        {
            m_is_prime = false;
            return;
        }
        else
        {
            for (unsigned i = 5; i * i <= m_number; i += 6)
            {
                if (m_number % i == 0 || m_number % (i + 2) == 0)
                {
                    m_is_prime = false;
                    return;
                }
            }
            m_is_prime = true;
            return;
        }
    }
public:
    unsigned m_number = 0;
    bool m_is_prime = false;
};

int main()
{
    ThreadPool pool;

    unsigned num_tasks = 1000000;
    std::vector<TestTask> tasks(num_tasks);
    for (auto&& task : tasks)
        task.Set(randint(0, 1000000000));

    auto s = std::chrono::high_resolution_clock::now();
    #if MT
    for (auto&& task : tasks)
        pool.add_task(&task);
    #else
    for (auto&& task : tasks)
        task.execute();
    #endif
    auto e = std::chrono::high_resolution_clock::now();
    double seconds = std::chrono::duration_cast<std::chrono::nanoseconds>(e - s).count() / 1000000000.0;
}

Benchmarks with VS2013 Profiler:

10,000,000 tasks:
    MT:
        13 seconds of wall clock time
        93.36% is spent in msvcp120.dll
        3.45% is spent in Task::execute() // Not good here
    ST:
        0.5 seconds of wall clock time
        97.31% is spent with Task::execute()
like image 275
James Nguyen Avatar asked Jan 25 '16 05:01

James Nguyen


1 Answers

Usual disclaimer in such answers: the only way to tell for sure is to measure it with a profiler tool.

But I will try to explain your results without it. First of all, you have one mutex across all your threads. So only one thread at a time can execute some task. It kills all your gains you might have. In spite of your threads your code is perfectly serial. So at the very least make your task execution out of the mutex. You need to lock the mutex only to get a task out of the queue — you don't need to hold it when the task gets executed.

Next, your tasks are so simple that single thread will execute them in no time. You just can't measure any gains with such tasks. Create some heavy tasks which could produce some more interesting results(some tasks which are closer to the real world, not such contrived).

And the 3rd point: threads are not without their cost — context switching, mutex contention etc. To have real gains, as the previous 2 points say, you need to have tasks which take more time than the overheads threads introduce and the code should be truly parallel instead of waiting on some resource making it serial.

UPD: I looked at the wrong part of the code. The task is complex enough provided you create tasks with sufficiently large numbers.


UPD2: I've played with your code and found a good prime number to show how the MT code is better. Use the following prime number: 1019048297. It will give enough computation complexity to show the difference.

But why your code doesn't produce good results? It is hard to tell without seeing the implementation of randint() but I take it is pretty simple and in a half of the cases it returns even numbers and other cases produce not much of big prime numbers either. So the tasks are so simple that context switching and other things around your particular implementation and threads in general consume more time than the computation itself. Using the prime number I gave you give the tasks no choice but spend time computing — no easy answer since the number is big and actually prime. That's why the big number will give you the answer you seek — better time for the MT code.

like image 114
ixSci Avatar answered Oct 02 '22 21:10

ixSci