Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do memory_order_seq_cst and memory_order_acq_rel differ?

Stores are release operations and loads are acquire operations for both. I know that memory_order_seq_cst is meant to impose an additional total ordering for all operations, but I'm failing to build an example where it isn't the case if all the memory_order_seq_cst are replaced by memory_order_acq_rel.

Do I miss something, or the difference is just a documentation effect, i.e. one should use memory_order_seq_cst if one intend not to play with a more relaxed model and use memory_order_acq_rel when constraining the relaxed model?

like image 653
AProgrammer Avatar asked Sep 09 '12 16:09

AProgrammer


2 Answers

http://en.cppreference.com/w/cpp/atomic/memory_order has a good example at the bottom that only works with memory_order_seq_cst. Essentially memory_order_acq_rel provides read and write orderings relative to the atomic variable, while memory_order_seq_cst provides read and write ordering globally. That is, the sequentially consistent operations are visible in the same order across all threads.

The example boils down to this:

bool x= false; bool y= false; int z= 0;  a() { x= true; } b() { y= true; } c() { while (!x); if (y) z++; } d() { while (!y); if (x) z++; }  // kick off a, b, c, d, join all threads assert(z!=0); 

Operations on z are guarded by two atomic variables, not one, so you can't use acquire-release semantics to enforce that z is always incremented.

like image 113
MSN Avatar answered Oct 06 '22 07:10

MSN


On ISAs like x86 where atomics map to barriers, and the actual machine model includes a store buffer:

  • seq_cst stores require flushing the store buffer so this thread's later reads are delayed until after the store is globally visible.

  • acquire or release do not have to flush the store buffer. Normal x86 loads and stores have essentially acq and rel semantics. (seq_cst plus a store buffer with store forwarding.)

    But x86 atomic RMW operations always get promoted to seq_cst because the x86 asm lock prefix is a full memory barrier. Other ISAs can do relaxed or acq_rel RMWs in asm, with the store side being able to do limited reordering with later stores. (But not in ways that would make the RMW appear non-atomic: For purposes of ordering, is atomic read-modify-write one operation or two?)


https://preshing.com/20120515/memory-reordering-caught-in-the-act is an instructive example of the difference between a seq_cst store and a plain release store. (It's actually mov + mfence vs. plain mov in x86 asm. In practice xchg is a more efficient way to do a seq_cst store on most x86 CPUs, but GCC does use mov+mfence)


Fun fact: AArch64's LDAR acquire-load instruction is actually a sequential-acquire, having a special interaction with STLR. Not until ARMv8.3 LDAPR can arm64 do plain acquire operations that can reorder with earlier release and seq_cst stores (STLR). (seq_cst loads still use LDAR because they need that interaction with STLR to recover sequential consistency; seq_cst and release stores both use STLR).

With STLR / LDAR you get sequential consistency, but only having to drain the store buffer before the next LDAR, not right away after each seq_cst store before other operations. I think real AArch64 HW does implement it this way, rather than simply draining the store buffer before committing an STLR.

Strengthening rel or acq_rel to seq_cst by using LDAR / STLR doesn't need to be expensive, unless you seq_cst store something, and then seq_cst load something else. Then it's just as bad as x86.

Some other ISAs (like PowerPC) have more choices of barriers and can strengthen up to mo_rel or mo_acq_rel more cheaply than mo_seq_cst, but their seq_cst can't be as cheap as AArch64; seq-cst stores need a full barrier.

So AArch64 is an exception to the rule that seq_cst stores drain the store buffer on the spot, either with a special instruction or a barrier instruction after. It's not a coincidence that ARMv8 was designed after C++11 / Java / etc. basically settled on seq_cst being the default for lockless atomic operations, so making them efficient was important. And after CPU architects had a few years to think about alternatives to providing barrier instructions or just acquire/release vs. relaxed load/store instructions.

like image 32
Peter Cordes Avatar answered Oct 06 '22 06:10

Peter Cordes