I have executed below program where I have created 100 threads to execute concurrently. Please note this is a sample program made. I understand multiple threads are not required for below program but my intention was to test mutex.
class ThreadPool{
public:
ThreadPool(int num = 10);
~ThreadPool();
void AssignPool();
void doSometask();
void inc();
private:
boost::asio::io_service ioService;
boost::thread_group threadpool;
boost::asio::io_service::work * work;
volatile int p_size;
int pool_sz;
boost::mutex io_mutex;// with boost lock
};
void ThreadPool::AssignPool()
{
std::cout<<std::endl<<"pool_sz="<<pool_sz<<std::endl;
for(int i=0;i<pool_sz;i++)
{
ioService.post(boost::bind(&ThreadPool::doSometask, this));
}
}
void ThreadPool::inc()
{
p_size++;
}
void ThreadPool::doSometask()
{
// boost::mutex::scoped_lock lock(io_mutex);
for(int i=0;i<10000;i++){
inc();
}
}
ThreadPool::ThreadPool(int num):p_size(0)
{
pool_sz = num;
work = new boost::asio::io_service::work(ioService);
for(int i =0;i<num;i++)
{
threadpool.create_thread(boost::bind(&boost::asio::io_service::run, &ioService )) ;
}
}
ThreadPool::~ThreadPool()
{
delete work;
ioService.stop();
threadpool.join_all();
}
int main()
{
ThreadPool p1(100);
p1.AssignPool();
}
Case 1: Above program was executed by commenting "boost::mutex::scoped_lock lock(io_mutex);" line which is "no mutex case". time taken by the program was
real 0m1.386s
user 0m0.483s
sys 0m9.937s
Case 2: With Mutex: However when I run this program with mutex i.e. "boost::mutex::scoped_lock lock(io_mutex);" line. This program is taking lesser time.
real 0m0.289s
user 0m0.067s
sys 0m0.230s
In my understanding with mutex, program should have taken much more time than without mutex. What went wrong here??
In your example you lock the mutex in doSometask()
, and, hence all the time only one thread will be running and it will finish the for loop before yielding to another task. Hence the program runs literally serial and no cache threshing occurs.
Without the lock, all threads will run when they get processor time, and assuming that the number of processors is significantly lower than 100 a lot of cache threshing will be going on on all levels (like Bo Persson wrote in the comments), and this will increase run-time.
A better way to measure the impact of the lock on run time would be (a) to run only as many threads as your computer has cores, so that cache threshing because of context switches are minimized, and (b), to put the lock into the ThreadPool::inc()
method so that the synchronization happens more often.
As a bonus you could run the lock-free method properly by declaring p_size
as std::atomic<int>
(C++11) and see impact of synchronization based on mutexes versus the use of atomics.
As commented, "one at a time in an orderly fashion" works better than everyone just rushing in, but it's not only that. Most likely the main reason is that the time you gave for each thread to work inside doSometask()
is just too little for modern CPUs.
Update your doSometask
to do more work, and make your threads be less dependent on colliding with each other by constantly accessing shared data:
#include <iostream>
#include <chrono>
#include <atomic>
#include <boost/asio/io_service.hpp>
#include <boost/thread.hpp>
class ThreadPool
{
public:
ThreadPool(int num = 10, int cycles = 10000);
~ThreadPool();
void inc(volatile int* x);
void AssignPool();
void doSometask(volatile int* x);
void AssignPoolSync();
void doSometaskSync(volatile int* x);
private:
boost::asio::io_service ioService;
boost::thread_group threadpool;
boost::asio::io_service::work * work;
std::atomic<int> p_size;
int *xsize;
int pool_sz, cycles;
boost::mutex io_mutex; // with boost lock
};
void ThreadPool::AssignPool()
{
for (int i = 0; i<pool_sz; ++i)
ioService.post(boost::bind(&ThreadPool::doSometask, this, &xsize[i]));
}
void ThreadPool::AssignPoolSync()
{
for (int i=0; i<pool_sz; ++i)
ioService.post(boost::bind(&ThreadPool::doSometaskSync, this, &xsize[i]));
}
void ThreadPool::inc(volatile int* x)
{
*x = *x + 1;
}
void ThreadPool::doSometask(volatile int* x)
{
for (int i=0; i<cycles; ++i)
{
inc(x);
if (i & 255 == 0)
p_size++; // access shared data evert 256 cycles
}
}
void ThreadPool::doSometaskSync(volatile int* x)
{
boost::mutex::scoped_lock lock(io_mutex);
doSometask(x);
}
ThreadPool::ThreadPool(int num, int cycles)
{
pool_sz = num;
p_size = 0;
this->cycles = cycles;
xsize = new int[num];
memset(xsize, 0, num * sizeof(int));
work = new boost::asio::io_service::work(ioService);
for (int i=0; i<pool_sz; ++i)
threadpool.create_thread(boost::bind(&boost::asio::io_service::run, &ioService));
}
ThreadPool::~ThreadPool()
{
delete work;
ioService.stop();
threadpool.join_all();
delete[] xsize;
}
int main(int argc, const char** argv)
{
const int C = argc>1 ? std::stoi(argv[1]) : 10000; // number of cycles
const int T = argc>2 ? std::stoi(argv[2]) : 100; // number of threads
const int N = argc>3 ? std::stoi(argv[3]) : 50; // number of times to time execution
long long t_min[2] = {0};
for (int i = 0; i<N*2; ++i)
{
auto t0 = std::chrono::high_resolution_clock::now();
{
Sleep(1);
ThreadPool pool(T, C);
if (i&1)
pool.AssignPoolSync();
else
pool.AssignPool();
}
auto t1 = std::chrono::high_resolution_clock::now();
t_min[i&1] = std::min(i>1 ? t_min[i&1] : (t1-t0).count(), (t1-t0).count());
}
printf("timeSync / time: %f\n", (t_min[1] + 0.0) / (t_min[0] + 0.0));
}
Using this test you can simulate better real work: jobs that threads run are mostly independent and sometimes they access shared data. You can also run it with different parameters to change number of loop cycles that each thread runs and number of threads.
These are the sample results that I get when running it on 4 core pc:
test> test.exe 10000 100
timeSync / time: 1.027782
test> test.exe 500000 100
timeSync / time: 3.531433
In other words, when each thread does only 10000 cycles synchronized version is almost as fast as non-synchronized, but I increase number of cycles to 500000 then synchronized version is 3.5 times slower
I am neither computer scientist nor OS expert. But Whenever I am trying to compare performance of two similar function, instead of comparing time take in single execution, I run function multiple times and compare average(My this approach my be wrong, it is works for me most of the time. I am open for input/comment from experts on this). My thought behind this is that as I am using OS, resource (mainly processor) are not fully allocated for application under observation. They are shared by many other processes at same time.
I try to do same with your application and get below result for executing above application 1000 times.
nomutex: 11.97 user | 5.76 system | 0:20.55 elapsed | 86% CPU
withmutex: 30.78 user | 8.78 system | 0:43.67 elapsed | 90% CPU
And now a days most of devices has multicore CPU so I used below link to force OS to use only single core. https://unix.stackexchange.com/a/23109
Hope this will help you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With