Simple multithreaded c++11 program where all threads lock same mutex in tight loop.
When it uses 8 threads (as the number of logical cpus) it can reach 5 million locks/second
But add just one additional thread - and the performance drops To 200,000/sec !
Edit:
Under g++4.8.2 (ubuntu x64): No performance degradation at all even with 100 threads! (and more than twice the performance but that's another story) - So this indeed seems to be a problem specific to VC++ mutex implementation
I reproduced it with the following code (Windows 7 x64):
#include <chrono>
#include <thread>
#include <memory>
#include <mutex>
#include <atomic>
#include <sstream>
#include <iostream>
using namespace std::chrono;
void thread_loop(std::mutex* mutex, std::atomic_uint64_t* counter)
{
while (true)
{
std::unique_lock<std::mutex> ul(*mutex);
counter->operator++();
}
}
int _tmain(int argc, _TCHAR* argv[])
{
int threads = 9;
std::mutex mutex;
std::atomic_uint64_t counter = 0;
std::cout << "Starting " << threads << " threads.." << std::endl;
for (int i = 0; i < threads; ++i)
new std::thread(&thread_loop, &mutex, &counter);
std::cout << "Started " << threads << " threads.." << std::endl;
while (1)
{
counter = 0;
std::this_thread::sleep_for(seconds(1));
std::cout << "Counter = " << counter.load() << std::endl;
}
}
The VS 2013 profiler tells me that most of time (95.7%) is wasted in a tight loop (line 697 in rtlocks.cpp):
while (IsBlocked() & & spinWait._SpinOnce())
{
//_YieldProcessor is called inside _SpinOnce
}
What could be the cause? How can this be improved?
OS: windows 7 x64
CPU: i7 3770 4 cores (x2 hyper threading)
As I know, If you create more thread than cpu cores, scheduler will manage threads execution by deciding, for every specific thread, which core could execute it and how much time it could keep running on it. After that period of time, resources will be deallocated and another thread should run on that core.
The Case of Creating Too Many Threads. Our job will take longer to finish if we generate thousands of threads since we'll have to spend time switching between their contexts. Use the thread pool to complete our task rather than creating new threads manually so that the OS can balance the ideal number of threads.
There is nothing in the C++ standard that limits number of threads. However, OS will certainly have a hard limit. Having too many threads decreases the throughput of your application, so it's recommended that you use a thread pool.
The size of the thread pool and the host operating system impact performance and processor utilization. In general, Endeca recommends using one thread per processor or core for good performance in most cases.
With 8 threads your code is spinning, but getting the lock without the CPU having to suspend the thread before it looses its time slice.
As you add more and more threads the contention level increases, and therefore the chance that the thread will not be able to acquire the lock within its timeslice. When this happens the thread is suspended and a context swith occurs to another thread, which the CPU will examine to see if the thread can be woken up.
All this swithing, suspending and waking up requires a transition from user mode to kernel mode, and this is an expensive operation, thus performace is significantly impacted.
To improve things either reduce the number of threads contending the lock or increase the number of cores available. In your example you're using a std::atomic
number, so you don't need to lock in order to call ++
on it, since it's already thread safe.
The mutex gives contention between each of the threads anyway, however if you try to use more threads than you have cores, even if they are ready, not all of them can run at once, so they will need to keep stopping and starting - known as context switching.
The only way you can "solve" this is to use fewer threads or get more cores.
Your problem is there are 8 threads store to a shared resource (not load, load a shared resource which can't modified is safe, and lock is needless).
Write lock-free algorithm is hard, but in your problem, there is a way.
std::atomic<uint64_t>
and delete the mutex, increase an atomic number is atomic by default(no special memory model).#include <chrono>
#include <thread>
#include <memory>
#include <atomic>
#include <sstream>
#include <iostream>
using namespace std::chrono;
void thread_loop(std::atomic<uint64_t>* counter)
{
while (true)
{
(*counter)++;
}
}
int main(int argc, char* argv[])
{
int threads = 9;
std::atomic<uint64_t> counter(0);
std::cout << "Starting " << threads << " threads.." << std::endl;
for (int i = 0; i < threads; ++i)
new std::thread(&thread_loop, &counter);
std::cout << "Started " << threads << " threads.." << std::endl;
while (1)
{
std::this_thread::sleep_for(seconds(1));
std::cout << "Counter = " << counter.load() << std::endl;
}
}
This maybe faster. enjoy ;-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With