Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Boost::mutex is taking less time than without mutex for a program

I have executed below program where I have created 100 threads to execute concurrently. Please note this is a sample program made. I understand multiple threads are not required for below program but my intention was to test mutex.

class ThreadPool{

    public:
    ThreadPool(int num = 10);
    ~ThreadPool();
    void AssignPool();
    void doSometask();
    void inc();
    private:
    boost::asio::io_service ioService;
    boost::thread_group threadpool;
    boost::asio::io_service::work * work;
   volatile int p_size;
    int pool_sz;
    boost::mutex io_mutex;// with boost lock

};

void ThreadPool::AssignPool()
{
        std::cout<<std::endl<<"pool_sz="<<pool_sz<<std::endl;
        for(int i=0;i<pool_sz;i++)
        {
                ioService.post(boost::bind(&ThreadPool::doSometask, this));
        }
}

void ThreadPool::inc()
{
        p_size++;
}

void ThreadPool::doSometask()
{

//      boost::mutex::scoped_lock lock(io_mutex);
        for(int i=0;i<10000;i++){
                inc();
        }


}

ThreadPool::ThreadPool(int num):p_size(0)
{
        pool_sz = num;
        work = new  boost::asio::io_service::work(ioService);
        for(int i =0;i<num;i++)
        {
                threadpool.create_thread(boost::bind(&boost::asio::io_service::run, &ioService  ))    ;
        }

}

ThreadPool::~ThreadPool()
{
        delete work;
        ioService.stop();
        threadpool.join_all();
}

int main()
{
        ThreadPool p1(100);
        p1.AssignPool();
}

Case 1: Above program was executed by commenting "boost::mutex::scoped_lock lock(io_mutex);" line which is "no mutex case". time taken by the program was

real    0m1.386s
user    0m0.483s
sys 0m9.937s

Case 2: With Mutex: However when I run this program with mutex i.e. "boost::mutex::scoped_lock lock(io_mutex);" line. This program is taking lesser time.

real    0m0.289s
user    0m0.067s
sys 0m0.230s

In my understanding with mutex, program should have taken much more time than without mutex. What went wrong here??

like image 257
SACHIN GOYAL Avatar asked Apr 03 '17 10:04

SACHIN GOYAL


3 Answers

In your example you lock the mutex in doSometask(), and, hence all the time only one thread will be running and it will finish the for loop before yielding to another task. Hence the program runs literally serial and no cache threshing occurs.

Without the lock, all threads will run when they get processor time, and assuming that the number of processors is significantly lower than 100 a lot of cache threshing will be going on on all levels (like Bo Persson wrote in the comments), and this will increase run-time.

A better way to measure the impact of the lock on run time would be (a) to run only as many threads as your computer has cores, so that cache threshing because of context switches are minimized, and (b), to put the lock into the ThreadPool::inc() method so that the synchronization happens more often.

As a bonus you could run the lock-free method properly by declaring p_size as std::atomic<int> (C++11) and see impact of synchronization based on mutexes versus the use of atomics.

like image 148
Gert Wollny Avatar answered Sep 29 '22 23:09

Gert Wollny


As commented, "one at a time in an orderly fashion" works better than everyone just rushing in, but it's not only that. Most likely the main reason is that the time you gave for each thread to work inside doSometask() is just too little for modern CPUs. Update your doSometask to do more work, and make your threads be less dependent on colliding with each other by constantly accessing shared data:

#include <iostream>
#include <chrono>
#include <atomic>
#include <boost/asio/io_service.hpp>
#include <boost/thread.hpp>


class ThreadPool
{
public:
    ThreadPool(int num = 10, int cycles = 10000);
    ~ThreadPool();

    void inc(volatile int* x);

    void AssignPool();
    void doSometask(volatile int* x);
    void AssignPoolSync();
    void doSometaskSync(volatile int* x);

private:
    boost::asio::io_service ioService;
    boost::thread_group threadpool;
    boost::asio::io_service::work * work;
    std::atomic<int> p_size;

    int *xsize;
    int pool_sz, cycles;
    boost::mutex io_mutex; // with boost lock
};

void ThreadPool::AssignPool()
{
    for (int i = 0; i<pool_sz; ++i)
        ioService.post(boost::bind(&ThreadPool::doSometask, this, &xsize[i]));
}

void ThreadPool::AssignPoolSync()
{
    for (int i=0; i<pool_sz; ++i)
        ioService.post(boost::bind(&ThreadPool::doSometaskSync, this, &xsize[i]));
}

void ThreadPool::inc(volatile int* x)
{
    *x = *x + 1;
}

void ThreadPool::doSometask(volatile int* x)
{
    for (int i=0; i<cycles; ++i)
    {
        inc(x);
        if (i & 255 == 0)
            p_size++; // access shared data evert 256 cycles
    }
}

void ThreadPool::doSometaskSync(volatile int* x)
{
    boost::mutex::scoped_lock lock(io_mutex);
    doSometask(x);
}

ThreadPool::ThreadPool(int num, int cycles)
{
    pool_sz = num;
    p_size = 0;
    this->cycles = cycles;
    xsize = new int[num];
    memset(xsize, 0, num * sizeof(int));
    work = new  boost::asio::io_service::work(ioService);
    for (int i=0; i<pool_sz; ++i)
        threadpool.create_thread(boost::bind(&boost::asio::io_service::run, &ioService));
}

ThreadPool::~ThreadPool()
{
    delete work;
    ioService.stop();
    threadpool.join_all();
    delete[] xsize;
}

int main(int argc, const char** argv)
{
    const int C = argc>1 ? std::stoi(argv[1]) : 10000; // number of cycles
    const int T = argc>2 ? std::stoi(argv[2]) : 100; // number of threads
    const int N = argc>3 ? std::stoi(argv[3]) : 50; // number of times to time execution

    long long t_min[2] = {0};
    for (int i = 0; i<N*2; ++i)
    {
        auto t0 = std::chrono::high_resolution_clock::now();
        {
            Sleep(1);
            ThreadPool pool(T, C);
            if (i&1)
                pool.AssignPoolSync();
            else
                pool.AssignPool();
        }
        auto t1 = std::chrono::high_resolution_clock::now();
        t_min[i&1] = std::min(i>1 ? t_min[i&1] : (t1-t0).count(), (t1-t0).count());
    }
    printf("timeSync / time: %f\n", (t_min[1] + 0.0) / (t_min[0] + 0.0));
}

Using this test you can simulate better real work: jobs that threads run are mostly independent and sometimes they access shared data. You can also run it with different parameters to change number of loop cycles that each thread runs and number of threads.

These are the sample results that I get when running it on 4 core pc:

test> test.exe 10000 100
timeSync / time: 1.027782

test> test.exe 500000 100
timeSync / time: 3.531433

In other words, when each thread does only 10000 cycles synchronized version is almost as fast as non-synchronized, but I increase number of cycles to 500000 then synchronized version is 3.5 times slower

like image 30
Pavel P Avatar answered Sep 29 '22 22:09

Pavel P


I am neither computer scientist nor OS expert. But Whenever I am trying to compare performance of two similar function, instead of comparing time take in single execution, I run function multiple times and compare average(My this approach my be wrong, it is works for me most of the time. I am open for input/comment from experts on this). My thought behind this is that as I am using OS, resource (mainly processor) are not fully allocated for application under observation. They are shared by many other processes at same time.

I try to do same with your application and get below result for executing above application 1000 times.

nomutex: 11.97 user | 5.76 system | 0:20.55 elapsed | 86% CPU

withmutex: 30.78 user | 8.78 system | 0:43.67 elapsed | 90% CPU

And now a days most of devices has multicore CPU so I used below link to force OS to use only single core. https://unix.stackexchange.com/a/23109

Hope this will help you.

like image 41
Manthan Tilva Avatar answered Sep 29 '22 23:09

Manthan Tilva