I am now learning C++11 memory order model and would like to understand the difference between memory_order_relaxed and memory_order_consume.
To be specific, I am looking for a simple example where one cannot replace memory_order_consume with memory_order_relaxed.
There is an excellent post which elaborates on a simple yet very illustrative example where memory_order_consume can be applied. Below is literal copy-paste.
Example:
atomic<int*> Guard(nullptr);
int Payload = 0;
Producer:
Payload = 42;
Guard.store(&Payload, memory_order_release);
Consumer:
g = Guard.load(memory_order_consume);
if (g != nullptr)
    p = *g;
My question consists of two parts:
memory_order_consume with memory_order_relaxed in the example above?memory_order_consume cannot be replaced with memory_order_relaxed?No.memory_order_relaxed imposes no memory order at all:
Relaxed operation: there are no synchronization or ordering constraints, only atomicity is required of this operation.
While memory_order_consume imposes memory ordering on data dependent reads (on the current thread)
A load operation with this memory order performs a consume operation on the affected memory location: no reads in the current thread dependent on the value currently loaded can be reordered before this load.
Edit
In general memory_order_seq_cst is stronger memory_order_acq_rel is stronger memory_ordering_relaxed.
This is like having a Elevator A that can lift 800 Kg Elevator C that lifts 100Kg.
Now if you had the power to magically change Elevator A into Elevator C, what would happen if the former was filled with 10 average-weighting people?
That would be bad.
To see what could go wrong with the code exactly, consider the example on your question:
Thread A                                   Thread B
Payload = 42;                              g = Guard.load(memory_order_consume);
Guard.store(1, memory_order_release);      if (g != 0)
                                               p = Payload;
This snippet are intended to be looped, there is no synchronization, only ordering, between the two threads.
With memory_order_relaxed, and assuming that a natural word load/store is atomic, the code would be equivalent to
Thread A                                   Thread B
Payload = 42;                              g = Guard
Guard = 1                                  if (g != 0)
                                               p = Payload;
From a CPU point of view on Thread A there are two stores to two separate addresses, so if Guard is "closer" to the CPU (meaning the store will complete faster) from another processor it seems that Thread A is perfoming
Thread A
Guard = 1
Payload = 42
And this order of execution is possible
Thread A   Guard = 1
Thread B   g = Guard
Thread B   if (g != nullptr) p = Payload
Thread A   Payload = 42
And that's bad, since Thread B read a non updated value of Payload.
It could seems however that in Thread B the synchronization would be useless since the CPU won't do a reorder like
Thread B
if (g != 0) p = Payload;
g = Guard
But it actually will.
From its perspective there are two unrelated load, it is true that one is on a dependent data path but the CPU can still speculatively do the load:
Thread B
hidden_tmp = Payload;
g = Guard
if (g != 0) p = hidden_tmp
That may generate the sequence
Thread B   hidden_tmp = Payload;
Thread A   Payload = 42;
Thread A   Guard = 1;
Thread B   g = Guard
Thread B   if (g != 0) p = hidden_tmp
Whoops.
In general that can never be done.
You can replace memory_order_acquire with memory_order_consume when you are going to generate an address dependency between the loaded value and the value(s) whose access need to be ordered.
To understand memory_order_relaxed we can take the ARM architecture as a reference.
The ARM Architecture mandates only a weak memory ordering meaning that in general the loads and stores of a program can be executed in any order.  
str r0, [r2]
str r0, [r3]
In the snippet above the store to [r3] can be observed, externally, before the store to [r2]1.
However the CPU doesn't go as far as the Alpha CPU and imposes two kinds of dependencies: address dependency, when a value load from memory is used to compute the address of another load/store, and control dependency, when a value load from memory is used to compute the control flags of another load/store.
In the presence of such dependency the ordering of two memory operations is guaranteed to be visible in program order:
If there is an address dependency then the two memory accesses are observed in program order.
So, while a memory_order_acquire would generate a memory barrier, with memory_order_consume you are telling the compiler that the way you'll use the loaded value will generate an address dependency and so it can, if relevant to the architecture, exploit this fact and omit a memory barrier.
1 If r2 is the address of a synchronization object, that's bad.
Can one replace
memory_order_consumewithmemory_order_relaxedin the example above?
Safely in ISO C++: no.
In practice on most implementations for most ISAs, often yes.  It will normally compile to asm with a data dependency between the first load result and the 2nd load's address, and most ISAs do guarantee that ordering.  (This is the HW feature consume was intended to expose).
But since C++11's design for consume was impractical for compilers to implement, they all just gave up and strengthen it to acquire, requiring a memory barrier on most weakly-ordered ISAs.  (e.g. POWER or ARM, but not x86).
So in real life, to get that juicy performance for reading things that nearly never change, some real code (like RCU) does actually use relaxed carefully, in ways that we hope won't get optimized into something unsafe.  See Paul E. McKenney's CppCon 2016 talk: C++ Atomics: The Sad Story of memory_order_consume: A Happy Ending At Last? about how Linux uses this to make the read of RCU side very very cheap, with no barriers.  (In kernel they just use volatile instead of _Atomic with memory_order_relaxed, but those compile essentially the same for pure-load or pure-store.)
By being careful about how you use consume, and knowing how compilers normally compile code, it's possible to get known compilers like gcc and clang to fairly reliably emit safe/correct and efficient asm for known targets like x86, ARM, and POWER that are known to do dependency ordering in hardware.
(x86 does acquire in hardware for you so if you only cared about x86 you'd be gaining nothing from using relaxed over consume or acquire.)
Can one suggest a similar example where
memory_order_consumecannot be replaced withmemory_order_relaxed?
DEC Alpha AXP doesn't guarantee dependency ordering in HW, and a few Alpha microarchitectures really could violate causality by loading a *g value older than g.  See Dependent loads reordering in CPU and also Memory order consume usage in C11 for a quote from Linus Torvalds about how only a few Alpha machines could actually do this.
Or for any ISA, it can break at compile time if the compiler breaks the data dependency with a control dependency.  e.g. if the compiler has some reason to think that g will have a certain value, it's allowed to transform to p = *g into code like
    if (g == expected_address)
        p = *expected_address;
    else
        p = *g;
Real CPUs use branch prediction so instructions after a branch can execute even if the g.load() hasn't finished yet.  So p = *expected_address can execute with no data dependency on g.
Weakly-ordered ISAs that do document their dependency ordering guarantees (POWER, ARM, etc.) don't guarantee it across branches, only true data dependencies.  (It would be fine if both sides of the branch used *g.)
This might not be something compilers are likely to do, but C++ consume guarantees that even array[foo.load(consume) & 1] is dependency-ordered after the load.  With only 2 possible values, it's more plausible that the compiler would branch.
(Or in your example, if atomic<int*> Guard(nullptr); is static and its address doesn't escape the compilation unit, then the compiler might see that the only 2 values it can ever have are nullptr or &Payload, and thus if it's non-null then it must be Payload.  So yes this optimization actually is plausible in your case, for mo_relaxed.  I think current gcc / clang probably won't ever make any assumptions about a value loaded from an atomic (like they treat volatile) so you're probably safe in practice.  This might change once C++ gets a way to make it safe for compilers to optimize atomics.  Can and does the compiler optimize out two atomic loads?)
In fact, ISO C++ consume even guarantees dependency ordering for int dep = foo.load(consume); dep -= dep;  p = array[dep];  You can use this to get dependency ordering after branching on a flag, for example, even after reducing the dependency to a value that's known at compile time1.  In this case zero.
But compilers look for cases where a variable is reduced to only 1 possible value, and will turn that p = array[dep] into p = array[0], removing the dependency on the load.  (This is the kind of dependency tracking to figure out when it was or wasn't safe to do normal optimizations that made consume nearly impossible to implement safely without gimping the compiler everywhere.  The carries_dependency and kill_dependency stuff might have limited this to function boundaries, but it still ended up being too hard.)
Footnote 1: This is why ISAs like ARM aren't even allowed to special case eor r0, r0 as a dependency-breaking zeroing idiom the way x86 does for xor eax,eax.  The asm rules do guarantee it's safe to do something like this in asm.  (And fixed-instruction-width ISAs have no use for xor-zeroing anyway; mov r0, #0 is the same size.)  The problem is getting compilers to emit asm with a dependency that's only required by consume, without doing any of their usual transformations that avoid data dependencies and create instruction-level parallelism for out-of-order execution to find and exploit.
See also P0371R1: Temporarily discourage memory_order_consume and other C++ wg21 documents linked from that about why consume is discouraged.
The difficulties appear to stem both from the high implementation complexity, from the fact that the current definition uses a fairly general definition of "dependency", thus requiring frequent and inconvenient use of the
kill_dependencycall, and from the frequent need for[[carries_dependency]]annotations. Details can be found in e.g. P0098R0.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With