Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

VC++: Performance drop x20 when more threads than cpus but not under g++

Simple multithreaded c++11 program where all threads lock same mutex in tight loop.

When it uses 8 threads (as the number of logical cpus) it can reach 5 million locks/second

But add just one additional thread - and the performance drops To 200,000/sec !

Edit:

Under g++4.8.2 (ubuntu x64): No performance degradation at all even with 100 threads! (and more than twice the performance but that's another story) - So this indeed seems to be a problem specific to VC++ mutex implementation

I reproduced it with the following code (Windows 7 x64):

#include <chrono>
#include <thread>
#include <memory>
#include <mutex>
#include <atomic>
#include <sstream>
#include <iostream>

using namespace std::chrono;

void thread_loop(std::mutex* mutex, std::atomic_uint64_t* counter)
{
    while (true)
    {
        std::unique_lock<std::mutex> ul(*mutex);        
        counter->operator++();                    
    }        
}

int _tmain(int argc, _TCHAR* argv[])
{    

    int threads = 9;
    std::mutex mutex;
    std::atomic_uint64_t counter = 0;

    std::cout << "Starting " << threads << " threads.." << std::endl;
    for (int i = 0; i < threads; ++i)
        new std::thread(&thread_loop, &mutex, &counter);

    std::cout << "Started " << threads << " threads.." << std::endl;
    while (1)
    {   
        counter = 0;
        std::this_thread::sleep_for(seconds(1));        
        std::cout << "Counter = " << counter.load() << std::endl;                
    }    
}

The VS 2013 profiler tells me that most of time (95.7%) is wasted in a tight loop (line 697 in rtlocks.cpp):

while (IsBlocked() & & spinWait._SpinOnce())
{
//_YieldProcessor is called inside _SpinOnce
}

What could be the cause? How can this be improved?

OS: windows 7 x64

CPU: i7 3770 4 cores (x2 hyper threading)

like image 296
GabiMe Avatar asked Jan 21 '14 10:01

GabiMe


People also ask

What happens if I use more threads than CPU cores?

As I know, If you create more thread than cpu cores, scheduler will manage threads execution by deciding, for every specific thread, which core could execute it and how much time it could keep running on it. After that period of time, resources will be deallocated and another thread should run on that core.

What happens if you use too many threads?

The Case of Creating Too Many Threads. Our job will take longer to finish if we generate thousands of threads since we'll have to spend time switching between their contexts. Use the thread pool to complete our task rather than creating new threads manually so that the OS can balance the ideal number of threads.

How many threads should I use C++?

There is nothing in the C++ standard that limits number of threads. However, OS will certainly have a hard limit. Having too many threads decreases the throughput of your application, so it's recommended that you use a thread pool.

What is the optimal number of threads?

The size of the thread pool and the host operating system impact performance and processor utilization. In general, Endeca recommends using one thread per processor or core for good performance in most cases.


3 Answers

With 8 threads your code is spinning, but getting the lock without the CPU having to suspend the thread before it looses its time slice.

As you add more and more threads the contention level increases, and therefore the chance that the thread will not be able to acquire the lock within its timeslice. When this happens the thread is suspended and a context swith occurs to another thread, which the CPU will examine to see if the thread can be woken up.

All this swithing, suspending and waking up requires a transition from user mode to kernel mode, and this is an expensive operation, thus performace is significantly impacted.

To improve things either reduce the number of threads contending the lock or increase the number of cores available. In your example you're using a std::atomic number, so you don't need to lock in order to call ++ on it, since it's already thread safe.

like image 59
Sean Avatar answered Oct 22 '22 06:10

Sean


The mutex gives contention between each of the threads anyway, however if you try to use more threads than you have cores, even if they are ready, not all of them can run at once, so they will need to keep stopping and starting - known as context switching.

The only way you can "solve" this is to use fewer threads or get more cores.

like image 45
doctorlove Avatar answered Oct 22 '22 07:10

doctorlove


Your problem is there are 8 threads store to a shared resource (not load, load a shared resource which can't modified is safe, and lock is needless).

  1. 8 threads > core num means
    • not every thread can run in a single cpu
    • there are more task schedules
  2. mutex
    • the thread can't acquired the mutext will sleep, and queued this thread to wait queue.(It seems the mutex implementation in windows use a short spin, then queued this thread to wait queue if not acquired the mutex?)

Write lock-free algorithm is hard, but in your problem, there is a way.

  1. If you can get more cores, get them
  2. use std::atomic<uint64_t> and delete the mutex, increase an atomic number is atomic by default(no special memory model).
  3. If the thread num is not constant, then change it to the core num, and then bind them

#include <chrono>
#include <thread>
#include <memory>
#include <atomic>
#include <sstream>
#include <iostream>

using namespace std::chrono;

void thread_loop(std::atomic<uint64_t>* counter)
{
    while (true)
    {
            (*counter)++;
    }
}

int main(int argc, char* argv[])
{

    int threads = 9;
    std::atomic<uint64_t> counter(0);

    std::cout << "Starting " << threads << " threads.." << std::endl;
    for (int i = 0; i < threads; ++i)
        new std::thread(&thread_loop, &counter);

    std::cout << "Started " << threads << " threads.." << std::endl;
    while (1)
    {
        std::this_thread::sleep_for(seconds(1));
        std::cout << "Counter = " << counter.load() << std::endl;
    }
}

This maybe faster. enjoy ;-)

like image 31
大宝剑 Avatar answered Oct 22 '22 08:10

大宝剑