Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why memory_order_relaxed performance is the same as memory_order_seq_cst

Tags:

c++

c++11

I've created a simple test to check how std::memory_order_relaxed is faster than std::memory_order_seq_cst value for atomic<int> increment. However the performance was the same for both cases.
My compiler: gcc version 7.3.0 (Ubuntu 7.3.0-27ubuntu1~18.04)
Build arguments: g++ -m64 -O3 main.cpp -std=c++17 -lpthread
CPU: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz, 4 core, 2 thread per core
Test code:

#include <vector>
#include <iostream>
#include <thread>
#include <atomic>
#include <chrono>
#include <functional>

std::atomic<int> cnt = {0};

void run_test_order_relaxed()
{
    std::vector<std::thread> v;
    for (int n = 0; n < 4; ++n) {
        v.emplace_back([]() {
            for (int n = 0; n < 30000000; ++n) {
                cnt.fetch_add(1, std::memory_order_relaxed);
            }
        });
    }
    std::cout << "rel: " << cnt.load(std::memory_order_relaxed);
    for (auto& t : v)
        t.join();
    }

void run_test_order_cst()
{
    std::vector<std::thread> v;
    for (int n = 0; n < 4; ++n) {
        v.emplace_back([]() {
            for (int n = 0; n < 30000000; ++n) {
                cnt.fetch_add(1, std::memory_order_seq_cst);
            }
        });
    }
    std::cout << "cst: " << cnt.load(std::memory_order_seq_cst);
    for (auto& t : v)
        t.join();
}

void measure_duration(const std::function<void()>& func)
{
    using namespace std::chrono;
    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    func();
    high_resolution_clock::time_point t2 = high_resolution_clock::now();
    auto duration = duration_cast<milliseconds>( t2 - t1 ).count();
    std::cout << " duration: " << duration << "ms" << std::endl;
}

int main()
{
    measure_duration(&run_test_order_relaxed);
    measure_duration(&run_test_order_cst); 
    return 0;
}

Why does std::memory_order_relaxed and std::memory_order_seq_cst always produce almost the same results?
Result:
rel: 2411 duration: 4440ms
cst: 120000164 duration: 4443ms

like image 723
Kiryl Avatar asked Dec 16 '18 18:12

Kiryl


People also ask

What is Memory_order_seq_cst?

The default is std::memory_order_seq_cst which establishes a single total ordering over all atomic operations tagged with this tag: all threads see the same order of such atomic operations and no memory_order_seq_cst atomic operations can be reordered.

What is Memory_order_acquire?

memory_order_acquire: Syncs reading this atomic variable AND makes sure relaxed vars written before this are synced as well. (does this mean all atomic variables on all threads are synced?) memory_order_release: Pushes the atomic store to other threads (but only if they read the var with consume/acquire)


1 Answers

Regardless of the memory order setting, you are requiring an atomic operation in both loops. It turns out that, with x86 processors, which are inherently strongly ordered in most situations, this results in using the same asm codes for each fetch_add: lock xadd. This atomic operation on x86 processors is always sequentially consistent, so there are no optimization opportunities here when specifying relaxed memory order.

Using relaxed memory order allows further optimizations of surrounding operations, but your code doesn't provide any further optimization opportunities, so the emitted code is the same. Note that the results may have been different with a weakly-ordered processor (e.g., ARM) or with more data manipulation within the loop (which could offer more reordering opportunities).

From cppreference (my italics):

std::memory_order specifies how regular, non-atomic memory accesses are to be ordered around an atomic operation.

The paper Memory Models for C/C++ Programmers provides much greater detail on this.

As a side note, repeatedly running atomic benchmarks or running them on different x86 processors (even by the same manufacturer) may result in dramatically different results, as the threads might not be distributed across all the cores equally, and cache latencies are affected by whether it is a local core, another core on the same chip, or on another chip. It's also affected by how the particular processor handles potential consistency conflicts. Furthermore, level 1, 2 and 3 caches behave differently, as does ram, so total size of the data set also has significant effects. See Evaluating the Cost of Atomic Operations on Modern Architectures.

like image 150
rsjaffe Avatar answered Jan 04 '23 12:01

rsjaffe