why memory_order_relaxed performance is the same as memory_order_seq_cst

Tags:

I've created a simple test to check how std::memory_order_relaxed is faster than std::memory_order_seq_cst value for atomic<int> increment. However the performance was the same for both cases.
My compiler: gcc version 7.3.0 (Ubuntu 7.3.0-27ubuntu1~18.04)
Build arguments: g++ -m64 -O3 main.cpp -std=c++17 -lpthread
CPU: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz, 4 core, 2 thread per core
Test code:

#include <vector>
#include <iostream>
#include <thread>
#include <atomic>
#include <chrono>
#include <functional>

std::atomic<int> cnt = {0};

void run_test_order_relaxed()
{
    std::vector<std::thread> v;
    for (int n = 0; n < 4; ++n) {
        v.emplace_back([]() {
            for (int n = 0; n < 30000000; ++n) {
                cnt.fetch_add(1, std::memory_order_relaxed);
            }
        });
    }
    std::cout << "rel: " << cnt.load(std::memory_order_relaxed);
    for (auto& t : v)
        t.join();
    }

void run_test_order_cst()
{
    std::vector<std::thread> v;
    for (int n = 0; n < 4; ++n) {
        v.emplace_back([]() {
            for (int n = 0; n < 30000000; ++n) {
                cnt.fetch_add(1, std::memory_order_seq_cst);
            }
        });
    }
    std::cout << "cst: " << cnt.load(std::memory_order_seq_cst);
    for (auto& t : v)
        t.join();
}

void measure_duration(const std::function<void()>& func)
{
    using namespace std::chrono;
    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    func();
    high_resolution_clock::time_point t2 = high_resolution_clock::now();
    auto duration = duration_cast<milliseconds>( t2 - t1 ).count();
    std::cout << " duration: " << duration << "ms" << std::endl;
}

int main()
{
    measure_duration(&run_test_order_relaxed);
    measure_duration(&run_test_order_cst); 
    return 0;
}

Why does std::memory_order_relaxed and std::memory_order_seq_cst always produce almost the same results?
Result:
rel: 2411 duration: 4440ms
cst: 120000164 duration: 4443ms

723

asked Dec 16 '18 18:12

Kiryl

1 Answers

Regardless of the memory order setting, you are requiring an atomic operation in both loops. It turns out that, with x86 processors, which are inherently strongly ordered in most situations, this results in using the same asm codes for each fetch_add: lock xadd. This atomic operation on x86 processors is always sequentially consistent, so there are no optimization opportunities here when specifying relaxed memory order.

Using relaxed memory order allows further optimizations of surrounding operations, but your code doesn't provide any further optimization opportunities, so the emitted code is the same. Note that the results may have been different with a weakly-ordered processor (e.g., ARM) or with more data manipulation within the loop (which could offer more reordering opportunities).

From cppreference (my italics):

std::memory_order specifies how regular, non-atomic memory accesses are to be ordered around an atomic operation.

The paper Memory Models for C/C++ Programmers provides much greater detail on this.

As a side note, repeatedly running atomic benchmarks or running them on different x86 processors (even by the same manufacturer) may result in dramatically different results, as the threads might not be distributed across all the cores equally, and cache latencies are affected by whether it is a local core, another core on the same chip, or on another chip. It's also affected by how the particular processor handles potential consistency conflicts. Furthermore, level 1, 2 and 3 caches behave differently, as does ram, so total size of the data set also has significant effects. See Evaluating the Cost of Atomic Operations on Modern Architectures.

150

answered Jan 04 '23 12:01

rsjaffe

Related questions
                            
                                Rust interop with C++ std::string
                            
                                Passing the "this" pointer to other class/function in destructor
                            
                                OpenSSL SSL_read Failure (error:00000005:lib(0):func(0):DH lib)
                            
                                Using CMAKE how to stop the "Debug" and "Release" subdirectories
                            
                                std::make_unique's (and emplace, emplace_back's) awkward deduction for initializer_list arguments
                            
                                template template parameter of unknown type
                            
                                performance comparsion between vector and raw c-style array
                            
                                extern "C" Default argument works or not?
                            
                                In C++ can I pass a structure as a pointer without declaring it locally?
                            
                                Is the 16-bit math in this program invoking undefined behavior?
                            
                                Is there a way to avoid this warning from clang-tidy (fuchsia-default-arguments) while initializing a string?
                            
                                Why does emplace_back("Hello") call strlen?
                            
                                How to disable vectorization in clang++?
                            
                                Check if a type is std::basic_string<T> in compile time in C++
                            
                                Order-preserving memcpy in C++
                            
                                Deducing Multiple Parameter Packs
                            
                                Is there a way to make this shortest path algorithm faster?
                            
                                std::experimental::source_location at compile time
                            
                                When converting to unsigned, the standard says "the least unsigned integer" is the result. Why does "least" matter here?
                            
                                C++ Order of Declaration (in Multi-variable Declaration Line)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

why memory_order_relaxed performance is the same as memory_order_seq_cst

Tags:

c++

c++11

Kiryl

People also ask

1 Answers

rsjaffe

Recent Activity

Donate For Us