VC++: Performance drop x20 when more threads than cpus but not under g++

Tags:

Simple multithreaded c++11 program where all threads lock same mutex in tight loop.

When it uses 8 threads (as the number of logical cpus) it can reach 5 million locks/second

But add just one additional thread - and the performance drops To 200,000/sec !

Edit:

Under g++4.8.2 (ubuntu x64): No performance degradation at all even with 100 threads! (and more than twice the performance but that's another story) - So this indeed seems to be a problem specific to VC++ mutex implementation

I reproduced it with the following code (Windows 7 x64):

#include <chrono>
#include <thread>
#include <memory>
#include <mutex>
#include <atomic>
#include <sstream>
#include <iostream>

using namespace std::chrono;

void thread_loop(std::mutex* mutex, std::atomic_uint64_t* counter)
{
    while (true)
    {
        std::unique_lock<std::mutex> ul(*mutex);        
        counter->operator++();                    
    }        
}

int _tmain(int argc, _TCHAR* argv[])
{    

    int threads = 9;
    std::mutex mutex;
    std::atomic_uint64_t counter = 0;

    std::cout << "Starting " << threads << " threads.." << std::endl;
    for (int i = 0; i < threads; ++i)
        new std::thread(&thread_loop, &mutex, &counter);

    std::cout << "Started " << threads << " threads.." << std::endl;
    while (1)
    {   
        counter = 0;
        std::this_thread::sleep_for(seconds(1));        
        std::cout << "Counter = " << counter.load() << std::endl;                
    }    
}

The VS 2013 profiler tells me that most of time (95.7%) is wasted in a tight loop (line 697 in rtlocks.cpp):

while (IsBlocked() & & spinWait._SpinOnce())
{
//_YieldProcessor is called inside _SpinOnce
}

What could be the cause? How can this be improved?

OS: windows 7 x64

CPU: i7 3770 4 cores (x2 hyper threading)

296

asked Jan 21 '14 10:01

GabiMe

3 Answers

With 8 threads your code is spinning, but getting the lock without the CPU having to suspend the thread before it looses its time slice.

As you add more and more threads the contention level increases, and therefore the chance that the thread will not be able to acquire the lock within its timeslice. When this happens the thread is suspended and a context swith occurs to another thread, which the CPU will examine to see if the thread can be woken up.

All this swithing, suspending and waking up requires a transition from user mode to kernel mode, and this is an expensive operation, thus performace is significantly impacted.

To improve things either reduce the number of threads contending the lock or increase the number of cores available. In your example you're using a std::atomic number, so you don't need to lock in order to call ++ on it, since it's already thread safe.

answered Oct 22 '22 06:10

Sean

The mutex gives contention between each of the threads anyway, however if you try to use more threads than you have cores, even if they are ready, not all of them can run at once, so they will need to keep stopping and starting - known as context switching.

The only way you can "solve" this is to use fewer threads or get more cores.

answered Oct 22 '22 07:10

doctorlove

Your problem is there are 8 threads store to a shared resource (not load, load a shared resource which can't modified is safe, and lock is needless).

8 threads > core num means
- not every thread can run in a single cpu
- there are more task schedules
mutex
- the thread can't acquired the mutext will sleep, and queued this thread to wait queue.(It seems the mutex implementation in windows use a short spin, then queued this thread to wait queue if not acquired the mutex?)

Write lock-free algorithm is hard, but in your problem, there is a way.

If you can get more cores, get them
use std::atomic<uint64_t> and delete the mutex, increase an atomic number is atomic by default(no special memory model).
If the thread num is not constant, then change it to the core num, and then bind them

#include <chrono>
#include <thread>
#include <memory>
#include <atomic>
#include <sstream>
#include <iostream>

using namespace std::chrono;

void thread_loop(std::atomic<uint64_t>* counter)
{
    while (true)
    {
            (*counter)++;
    }
}

int main(int argc, char* argv[])
{

    int threads = 9;
    std::atomic<uint64_t> counter(0);

    std::cout << "Starting " << threads << " threads.." << std::endl;
    for (int i = 0; i < threads; ++i)
        new std::thread(&thread_loop, &counter);

    std::cout << "Started " << threads << " threads.." << std::endl;
    while (1)
    {
        std::this_thread::sleep_for(seconds(1));
        std::cout << "Counter = " << counter.load() << std::endl;
    }
}

This maybe faster. enjoy ;-)

answered Oct 22 '22 08:10

大宝剑

Related questions
                            
                                Why do streams still convert to pointers in C++11?
                            
                                I need to change compiler on VS
                            
                                Can enum class be nested?
                            
                                pkg-config can't find opencv
                            
                                Including arithmetic operations when defining a constant
                            
                                Why seekg does not work with getline?
                            
                                Passing messages between threads and memory management
                            
                                Understanding std::function and std::bind
                            
                                Is there anyway I can make a template<typename T> accept multiple values?
                            
                                Evaluation order of double assignment
                            
                                Perfect forwarding for void and non-void returning functions
                            
                                How to count a particular character in QString Qt
                            
                                SDL_Texture - Incomplete type
                            
                                C++11 tuple performance
                            
                                C++11 template alias as template template argument leads to different type?
                            
                                Why change in LD_LIBRARY_PATH at Runtime dosen't Reflect on the Executable once the Executable gets loaded
                            
                                How to use BOOST_THROW_EXCEPTION?
                            
                                Why Array *new Array; fails in C++? [duplicate]
                            
                                How to obtain a const qualified declval?
                            
                                How compute array size during compilation (without accepting pointers)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

VC++: Performance drop x20 when more threads than cpus but not under g++

Tags:

c++

multithreading

c++11

visual-studio-2013

visual-c++

GabiMe

People also ask

3 Answers

Sean

doctorlove

大宝剑

Recent Activity

Donate For Us