Can't relaxed atomic fetch_add reorder with later loads on x86, like store can?

Question

This program will sometimes print 00, but if I comment out a.store and b.store and uncomment a.fetch_add and b.fetch_add which does the exact same thing i.e both set the value of a=1,b=1 , I never get 00. (Tested on an x86-64 Intel i3, with g++ -O2)

Am i missing something, or can "00" never occur by the standard?

This is the version with plain stores, which can print 00.

// g++ -O2 -pthread axbx.cpp  ; while [ true ]; do ./a.out  | grep "00" ; done
#include<cstdio>
#include<thread>
#include<atomic>
using namespace std;
atomic<int> a,b;
int reta,retb;

void foo(){
        //a.fetch_add(1,memory_order_relaxed);
        a.store(1,memory_order_relaxed);
        retb=b.load(memory_order_relaxed);
}

void bar(){
        //b.fetch_add(1,memory_order_relaxed);
        b.store(1,memory_order_relaxed);
        reta=a.load(memory_order_relaxed);
}

int main(){
        thread t[2]{ thread(foo),thread(bar) };
        t[0].join(); t[1].join();
        printf("%d%d
",reta,retb);
        return 0;
}

The below never prints 00

// g++ -O2 -pthread axbx.cpp  ; while [ true ]; do ./a.out  | grep "00" ; done
#include<cstdio>
#include<thread>
#include<atomic>
using namespace std;
atomic<int> a,b;
int reta,retb;

void foo(){
        a.fetch_add(1,memory_order_relaxed);
        //a.store(1,memory_order_relaxed);
        retb=b.load(memory_order_relaxed);
}

void bar(){
        b.fetch_add(1,memory_order_relaxed);
        //b.store(1,memory_order_relaxed);
        reta=a.load(memory_order_relaxed);
}

int main(){
        thread t[2]{ thread(foo),thread(bar) };
        t[0].join(); t[1].join();
        printf("%d%d
",reta,retb);
        return 0;
}

Look at this as well Multithreading atomics a b printing 00 for memory_order_relaxed

Peter Cordes · Accepted Answer

The standard allows 00, but you'll never get it on x86 (without compile-time reordering). The only way to implement atomic RMW on x86 involves a lock prefix, which is a "full barrier", which is strong enough for seq_cst.

In C++ terms, atomic RMWs are effectively promoted to seq_cst when compiling for x86. (Only after possible compile-time ordering is nailed down - e.g. non-atomic loads / stores can reorder / combine across a relaxed fetch_add, and so can other relaxed operations, and one-way reordering with acquire or release operations. Although compilers are less likely to reorder atomic ops with each other since they don't combine them, and doing so is one of the main reasons for compile-time reordering.)

In fact, most compilers implement a.store(1, mo_seq_cst) by using xchg (which has an implicit lock prefix), because it's faster than mov + mfence on modern CPUs, and turning 0 into 1 with lock add as the only write to each object is exactly identical. Fun fact: with just store and load, your code will compile to the same asm as https://preshing.com/20120515/memory-reordering-caught-in-the-act/, so the discussion there applies.

ISO C++ allows the whole relaxed RMW to reorder with the relaxed load, but normal compilers won't do that at compile-time for no reason. (A DeathStation 9000 C++ implementation could/would). So you've finally found a case where it would be useful to test on a different ISA. The ways in which an atomic RMW (or even parts of it) can reorder at run-time depend a lot on the ISA.

An LL/SC machine that needs a retry loop to implement fetch_add (for example ARM, or AArch64 before ARMv8.1) may be able to truly implement a relaxed RMW that can reorder at run-time because anything stronger than relaxed would require barriers. (Or acquire / release versions of the instructions like AArch64 ldaxr / stlxr vs. ldxr/stxr). So if there's an asm difference between relaxed and acq and/or rel (and sometimes seq_cst is also different), it's likely that difference is necessary and preventing some run-time reordering.

Even a single-instruction atomic operation might be able to be truly relaxed on AArch64; I haven't investigated. Most weakly-ordered ISAs have traditionally used LL/SC atomics, so I might just be conflating those.

In an LL/SC machine, the store side of an LL/SC RMW can even reorder with later loads separately from the load, unless they're both seq_cst. For purposes of ordering, is atomic read-modify-write one operation or two?

To actually see 00, both loads would have to happen before the store part of the RMW was visible in the other thread. And yes, the HW reordering mechanism in an LL/SC machine would I think be pretty similar to reordering a plain store.

Can't relaxed atomic fetch_add reorder with later loads on x86, like store can?

Tags:

c++

cpu-architecture

multithreading

stdatomic

memory-barriers

nvn

1 Answers

Peter Cordes

Recent Activity

Donate For Us

Can't relaxed atomic fetch_add reorder with later loads on x86, like store can?

Tags:

c++

cpu-architecture

multithreading

stdatomic

memory-barriers

nvn

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us