Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Will two atomic writes to different locations in different threads always be seen in the same order by other threads?

Similar to my previous question, consider this code

-- Initially --
std::atomic<int> x{0};
std::atomic<int> y{0};

-- Thread 1 --
x.store(1, std::memory_order_release);

-- Thread 2 --
y.store(2, std::memory_order_release);

-- Thread 3 --
int r1 = x.load(std::memory_order_acquire);   // x first
int r2 = y.load(std::memory_order_acquire);

-- Thread 4 --
int r3 = y.load(std::memory_order_acquire);   // y first
int r4 = x.load(std::memory_order_acquire);

Is the weird outcome r1==1, r2==0 and r3==2, r4==0 possible in this case under the C++11 memory model? What if I were to replace all std::memory_order_acq_rel by std::memory_order_relaxed?

On x86 such an outcome seems to be forbidden, see this SO question but I am asking about the C++11 memory-model in general.

Bonus question:

We all agree, that with std::memory_order_seq_cst the weird outcome would not be allowed in C++11. Now, Herb Sutter said in his famous atomic<>-weapons talk @ 42:30 that std::memory_order_seq_cst is just like std::memory_order_acq_rel but std::memory_order_acquire-loads may not move before std::memory_order_release-writes. I cannot see how this additional constraint in the above example would prevent the weird outcome. Can anyone explain?

like image 561
Toby Brull Avatar asked Jan 06 '15 21:01

Toby Brull


4 Answers

This kind of reordering test is called IRIW (Independent Readers, Independent Writers), where we're checking if two readers can see the same pair of stores appear in different orders. Related, maybe a duplicate: Acquire/release semantics with 4 threads


The very weak C++11 memory model does not require that all threads agree on a global order for stores, as @MWid's answer says.

This answer will explain one possible hardware mechanism that can lead to threads disagreeing about the global order of stores, which may be relevant when setting up tests for lockless code. And just because it's interesting if you like cpu-architecture1.

See A Tutorial Introduction to the ARM and POWER Relaxed Memory Models for an abstract model of what those ISAs: Neither ARM nor POWER guarantee of a consistent global store order seen by all threads. Actually observing this is possible in practice on POWER chips, and maybe possible in theory on ARM but maybe not on any actual implementations.

(Other weakly-ordered ISAs like Alpha also allow this reordering, I think. ARM used to allow it on-paper, but probably no real implementations did this reordering. ARMv8 even strengthened their on-paper model to disallow this even for future hardware.)

In computer science, the term for a machine where stores become visible to all other threads at the same time (and thus there is a single global order of stores) is "multiple-copy atomic" or "multi-copy atomic". x86 and SPARC's TSO memory models have that property, but ARM and POWER don't require it.


Current SMP machines use MESI to maintain a single coherent cache domain so that all cores have the same view of memory. Stores become globally visible when they commit from the store buffer into L1d cache. At that point a load from any other core will see that store. There is a single order of all stores committing to cache, because MESI maintains a single coherency domain. With sufficient barriers to stop local reordering, sequential consistency can be recovered.

A store can become visible to some but not all other cores before it becomes globally visible.

POWER CPUs use Simultaneous MultiThreading (SMT) (the generic term for hyperthreading) to run multiple logical cores on one physical core. The memory-ordering rules we care about are for logical cores that threads run on, not physical cores.

We normally think of loads as taking their value from L1d, but that's not the case when reloading a recent store from the same core and data is forwarded directly from the store buffer. (Store-to-load forwarding, or SLF). It's even possible for a load to get a value that was never present in L1d and never will be, even on strongly-ordered x86, with partial SLF. (See my answer on Globally Invisible load instructions).

The store buffer tracks speculative stores before the store instruction has retired, but also buffers non-speculative stores after they retire from the out-of-order-execution part of the core (the ROB / ReOrder Buffer).

The logical cores on the same physical core share a store buffer. Speculative (not-yet-retired) stores must stay private to each logical core. (Otherwise that would couple their speculation together and require both to roll-back if a mis-speculation were detected. That would defeat part of the purpose of SMT, of keeping the core busy while one thread is stalled or recovering from a branch mispredict).

But we can let other logical cores snoop the store buffer for non-speculative stores that will definitely commit to L1d cache eventually. Until they do, threads on other physical cores can't see them, but logical cores sharing the same physical core can.

(I'm not sure this is exactly the HW mechanism that allows this weirdness on POWER, but it's plausible).

This mechanism makes stores visible to SMT sibling cores before they're globally visible to all cores. But it's still local within the core, so this reordering can be cheaply avoided with barriers that just affect the store buffer, without actually forcing any cache interactions between cores.

(The abstract memory model proposed in the ARM/POWER paper models this as each core having its own cached view of memory, with links between caches that let them sync. But in typical physical modern hardware, I think the only mechanism is between SMT siblings, not between separate cores.)


Note that x86 can't allow other logical cores to snoop the store buffer at all because that would violate x86's TSO memory model (by allowing this weird reordering). As my answer on What will be used for data exchange between threads are executing on one Core with HT? explains, Intel CPUs with SMT (which Intel calls Hyperthreading) statically partition the store buffer between logical cores.


Footnote 1: An abstract model for C++, or for asm on a particular ISA, is all you really need to know to reason about memory ordering.

Understanding the hardware details isn't necessary (and can lead you into a trap of thinking something's impossible just because you can't imagine a mechanism for it).

like image 138
Peter Cordes Avatar answered Oct 05 '22 13:10

Peter Cordes


The updated1 code in the question (with loads of x and y swapped in Thread 4) does actually test that all threads agree on a global store order.

Under the C++11 memory model, the outcome r1==1, r2==0, r3==2, r4==0 is allowed and in fact observable on POWER.

On x86 this outcome is not possible, because there "stores are seen in a consistent order by other processors". This outcome is also not allowed in a sequential consistent execution.


Footnote 1: The question originally had both readers read x then y. A sequentially consistent execution of that is:

-- Initially --
std::atomic<int> x{0};
std::atomic<int> y{0};

-- Thread 4 --
int r3 = x.load(std::memory_order_acquire);

-- Thread 1 --
x.store(1, std::memory_order_release);

-- Thread 3 --
int r1 = x.load(std::memory_order_acquire);
int r2 = y.load(std::memory_order_acquire);

-- Thread 2 --
y.store(2, std::memory_order_release);

-- Thread 4 --
int r4 = y.load(std::memory_order_acquire);

This results in r1==1, r2==0, r3==0, r4==2. Hence, this is not a weird outcome at all.

To be able to say that each reader saw a different store order, we need them to read in opposite orders to rule out the last store simply being delayed.

like image 32
MWid Avatar answered Oct 01 '22 13:10

MWid


The short answer is no. The standard doesn't say they must be, and therefore they don't have to be. It doesn't matter whether you can or can't imagine a specific way for this to happen.

like image 38
David Schwartz Avatar answered Oct 04 '22 13:10

David Schwartz


Is the weird outcome r1==1, r2==0 and r3==0, r4==2 possible in this case under the C++11 memory model?

Yes. C++ memory model allows such weird outcome.

What if I were to replace all std::memory_order_acq_rel by std::memory_order_relaxed?

If you replace all memory_order_acquire and memory_order_release by memory_order_relaxed, nothing changed for your code.

std::memory_order_seq_cst is just like std::memory_order_acq_rel but std::memory_order_acquire-loads may not move before std::memory_order_release-writes. I cannot see how this additional constraint in the above example would prevent the weird outcome.

"acquire-loads may not move before release-writes." shows one aspect of constraints of sequential consistency (memory_order_seq_cst).

In C++ memory model, it only guarantees that seq_cst has acq_rel semantics and all seq_cst atomic access has some "total order" no more and no less. When such "total order" is exist, we can't get weird outcome because all seq_cst atomic access are executed as if in any interleaved order on single thread.

Your previous question treats "coherency" of single atomic variable, and this question asks "consistency" of all atomic variables. C++ memory model guarantees intuitive coherency for single atomic variable even weakest ordering (relaxed), and "sequential consistency" for different atomic variables as long as default ordering (seq_cst). When you use explicitly non-seq_cst ordering atomic access, it's may be weird outcome as you pointed out.

like image 20
yohjp Avatar answered Oct 01 '22 13:10

yohjp