Why is std::mutex faster than std::atomic?

Tags:

I want to put objects in std::vector in multi-threaded mode. So I decided to compare two approaches: one uses std::atomic and the other std::mutex. I see that the second approach is faster than the first one. Why?

I use GCC 4.8.1 and, on my machine (8 threads), I see that the first solution requires 391502 microseconds and the second solution requires 175689 microseconds.

Click to copy

#include <vector>
#include <omp.h>
#include <atomic>
#include <mutex>
#include <iostream>
#include <chrono>

int main(int argc, char* argv[]) {
    const size_t size = 1000000;
    std::vector<int> first_result(size);
    std::vector<int> second_result(size);
    std::atomic<bool> sync(false);

    {
        auto start_time = std::chrono::high_resolution_clock::now();
        #pragma omp parallel for schedule(static, 1)
        for (int counter = 0; counter < size; counter++) {
            while(sync.exchange(true)) {
                std::this_thread::yield();
            };
            first_result[counter] = counter;
            sync.store(false) ;
        }
        auto end_time = std::chrono::high_resolution_clock::now();
        std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
    }

    {
        auto start_time = std::chrono::high_resolution_clock::now();
        std::mutex mutex; 
        #pragma omp parallel for schedule(static, 1)
        for (int counter = 0; counter < size; counter++) {
            std::unique_lock<std::mutex> lock(mutex);       
            second_result[counter] = counter;
        }
        auto end_time = std::chrono::high_resolution_clock::now();
        std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
    }

    return 0;
}

339

asked Apr 09 '15 08:04

Sergey Malashenko

1 Answers

I don't think your question can be answered referring only to the standard- mutexes are as platform-dependent as they can be. However, there is one thing, that should be mentioned.

Mutexes are not slow. You may have seen some articles, that compare their performance against custom spin-locks and other "lightweight" stuff, but that's not the right approach - these are not interchangeable.

Spin locks are considerably fast, when they are locked (acquired) for a relatively short amount of time - acquiring them is very cheap, but other threads, that are also trying to lock, are active for whole this time (running constantly in loop).

Custom spin-lock could be implemented this way:

Click to copy

class SpinLock
{
private:
    std::atomic_flag _lockFlag;

public:
    SpinLock()
    : _lockFlag {ATOMIC_FLAG_INIT}
    { }

    void lock()
    {
        while(_lockFlag.test_and_set(std::memory_order_acquire))
        { }
    }

    bool try_lock()
    {
        return !_lockFlag.test_and_set(std::memory_order_acquire);
    }

    void unlock()
    {
        _lockFlag.clear();
    }
};

Mutex is a primitive, that is much more complicated. In particular, on Windows, we have two such primitives - Critical Section, that works in per-process basis and Mutex, which doesn't have such limitation.

Locking mutex (or critical section) is much more expensive, but OS has the ability to really put other waiting threads to "sleep", which improves performance and helps tasks scheduler in efficient resources management.

Why I write this? Because modern mutexes are often so-called "hybrid mutexes". When such mutex is locked, it behaves like a normal spin-lock - other waiting threads perform some number of "spins" and then heavy mutex is locked to prevent from wasting resources.

In your case, mutex is locked in each loop iteration to perform this instruction:

Click to copy

second_result[counter] = omp_get_thread_num();

It looks like a fast one, so "real" mutex may never be locked. That means, that in this case your "mutex" can be as fast as atomic-based solution (because it becomes an atomic-based solution itself).

Also, in the first solution you used some kind of spin-lock-like behaviour, but I am not sure if this behaviour is predictable in multi-threaded environment. I am pretty sure, that "locking" should have acquire semantics, while unlocking is a release op. Relaxed memory ordering may be too weak for this use case.

I edited the code to be more compact and correct. It uses the std::atomic_flag, which is the only type (unlike std::atomic<> specializations), that is guaranteed to be lock-free (even std::atomic<bool> does not give you that).

Also, referring to the comment below about "not yielding": it is a matter of specific case and requirements. Spin locks are very important part of multi-threaded programming and their performance can often be improved by slightly modifying its behavior. For example, Boost library implements spinlock::lock() as follows:

Click to copy

void lock()
{
    for( unsigned k = 0; !try_lock(); ++k )
    {
        boost::detail::yield( k );
    }
}

source: boost/smart_ptr/detail/spinlock_std_atomic.hpp

Where detail::yield() is (Win32 version):

Click to copy

inline void yield( unsigned k )
{
    if( k < 4 )
    {
    }
#if defined( BOOST_SMT_PAUSE )
    else if( k < 16 )
    {
        BOOST_SMT_PAUSE
    }
#endif
#if !BOOST_PLAT_WINDOWS_RUNTIME
    else if( k < 32 )
    {
        Sleep( 0 );
    }
    else
    {
        Sleep( 1 );
    }
#else
    else
    {
        // Sleep isn't supported on the Windows Runtime.
        std::this_thread::yield();
    }
#endif
}

[source: http://www.boost.org/doc/libs/1_66_0/boost/smart_ptr/detail/yield_k.hpp]

First, thread spins for some fixed number of times (4 in this case). If mutex is still locked, pause instruction is used (if available) or Sleep(0) is called, which basically causes context-switch and allows scheduler to give another blocked thread a chance to do something useful. Then, Sleep(1) is called to perform actual (short) sleep. Very nice!

Also, this statement:

The purpose of a spinlock is busy waiting

is not entirely true. The purpose of spinlock is to serve as a fast, easy-to-implement lock primitive - but it still needs to be written properly, with certain possible scenarios in mind. For example, Intel says (regarding Boost's usage of _mm_pause() as a method of yielding inside lock()):

In the spin-wait loop, the pause intrinsic improves the speed at which the code detects the release of the lock and provides especially significant performance gain.

So, implementations like void lock() { while(m_flag.test_and_set(std::memory_order_acquire)); } may not be as good as it seems.

answered Oct 06 '22 09:10

Mateusz Grzejek

Related questions
                            
                                Are const and & ignored in std::function?
                            
                                delete a NULL pointer does not call overloaded delete when destructor is written
                            
                                Checking the sizeof an integer type in the preprocessor
                            
                                how to print out all elements in a std::stack or std::queue conveniently
                            
                                aggregate 'std::stringstream out' has incomplete type and cannot be defined [C++] [closed]
                            
                                Detect operator support with decltype/SFINAE
                            
                                Not able to compile C/C++ code that's using ncurses [duplicate]
                            
                                C++ - must friend functions be defined in the header file?
                            
                                Double (two-dimensional) array using std::unique_ptr
                            
                                How to change text and background color?
                            
                                Why do functions/objects inside anonymous namespace have external linkage?
                            
                                What is the optimal algorithm for generating an unbiased random integer within a range?
                            
                                Calling overloaded member functions using std::thread
                            
                                Error when compiling Rcpp code in an R package using RStudio
                            
                                function which is able to return different types?
                            
                                How can I combine multiple char's to make a string?
                            
                                Passing `-std=c++11` to CMakeLists?
                            
                                Deleting array of pointers
                            
                                OpenMP C++ Matrix Multiplication run slower in parallel
                            
                                C++ Exception - Throw a String

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is std::mutex faster than std::atomic?

Tags:

c++

atomic

mutex

Sergey Malashenko

People also ask

1 Answers

Mateusz Grzejek

Recent Activity

Donate For Us