Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does memory_order_consume really do?

From the link: What is the difference between load/store relaxed atomic and normal variable?

I was deeply impressed by this answer:

Using an atomic variable solves the problem - by using atomics all threads are guarantees to read the latest writen-value even if the memory order is relaxed.

Today, i read the the link below: https://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/

atomic<int*> Guard(nullptr);
int Payload = 0;

thread1:

  Payload = 42;
    Guard.store(&Payload, memory_order_release);

thred2:

g = Guard.load(memory_order_consume);
if (g != nullptr)
    p = *g;

enter image description here

QUESTION: I learned that Data Dependency prevents related instruction be reordered. But i think that is obvious for ensure the correctness of execution results. It doesn't matter whether comsume-release semantic exists or not. So i wonder comsume-release really do. Oh, maybe it uses data dependencies to prevent reordering of instructions while ensuring the visibility of Payload?

So

Is it possible to get the same correct result using memory_order_relaxed if i make that 1.preventing instruction be reordered 2.ensuring the visibility of non atomic var of Payload :

atomic<int*> Guard(nullptr);
volatile int Payload = 0;   // 1.Payload is volatile now

// 2.Payload.assign and Guard.store in order for data dependency
Payload = 42;               
Guard.store(&Payload, memory_order_release);

// 3.data Dependency make w/r of g/p in order
g = Guard.load(memory_order_relaxed);  
if (g != nullptr)
    p = *g;      // 4. For 1,2,3 there are no reorder, and here, volatile Payload make the value of 42 is visable.

Additional content(because of Sneftel's anwser):

1.Payload = 42; volatile make the W/R of Payload to/from main memory but not to/from cache.So 42 will write to memory.

2.Guard.store(&Payload, any MO flag can use for writting); Guard is non-volatile as you said, but is atomic

Using an atomic variable solves the problem - by using atomics all threads are guarantees to read the latest writen-value even if the memory order is relaxed.

In fact, atomics are always thread safe, regardless of the memory order! The memory order is not for the atomics -> it's for non atomic data.

So after Guard.store performing, Guard.load (with any MO flag can use for reading) can get the address of Payload correcttly. And then get the 42 from memory correcttly.

Above code:

1.no reorder effect for data dependency .

2.no cache effect for volatile Payload

3.no thread-safe problem for atomic Guard

Can i get the correct value - 42?

Back to the main question

When you use consume semantics, you’re basically trying to make the compiler exploit data dependencies on all those processor families. That’s why, in general, it’s not enough to simply change memory_order_acquire to memory_order_consume. You must also make sure there are data dependency chains at the C++ source code level.

enter image description here

" You must also make sure there are data dependency chains at the C++ source code level."

I think the data dependency chains at the C++ source code level prevents instruction are reordered naturally. So What does memory_order_consume really do?

And can I use memory_order_relaxed to achieve the same result as above code?

Additional content end

like image 660
breaker00 Avatar asked Dec 17 '20 07:12

breaker00


People also ask

What is Memory_order_acquire?

memory_order_acquire: Syncs reading this atomic variable AND makes sure relaxed vars written before this are synced as well. (does this mean all atomic variables on all threads are synced?) memory_order_release: Pushes the atomic store to other threads (but only if they read the var with consume/acquire)

What is a relaxed Atomic?

Relaxed ordering Atomic operations tagged memory_order_relaxed are not synchronization operations; they do not impose an order among concurrent memory accesses. They only guarantee atomicity and modification order consistency.

What is C++ memory model?

The memory model means that C++ code now has a standardized library to call regardless of who made the compiler and on what platform it's running. There's a standard way to control how different threads talk to the processor's memory.


2 Answers

First of all, memory_order_consume is temporarily discouraged by the ISO C++ committee until they come up with something compilers can actually implement. For a few years now, compilers have treated consume as a synonym for acquire. See the section at the bottom of this answer.

Hardware still provides the data dependency, so it's interesting to talk about that, despite not having any safely portable ISO C++ ways to take advantage currently. (Only hacks with mo_relaxed or hand-rolled atomics, and careful coding based on understanding of compiler optimizations and asm, kind of like you're trying to do with relaxed. But you don't need volatile.)

Oh, maybe it uses data dependencies to prevent reordering of instructions while ensuring the visibility of Payload?

Not exactly "reordering of instructions", but memory reordering. As you say, sanity and causality are enough in this case if the hardware provides dependency ordering. C++ is portable to machines that don't. (e.g DEC Alpha.)

The normal way to get visibility for Payload is via release-store in the writer, acquire load in the reader which sees the value from that release-store. https://preshing.com/20120913/acquire-and-release-semantics/. (So of course repeatedly storing the same value to a "ready_flag" or pointer doesn't let the reader figure out whether it's seeing a new or old store.)

Release / acquire creates a happens-before synchronization relationship between the threads, which guarantees visibility of everything the writer did before the release-store. (consume doesn't, that's why only the dependent loads are ordered.)

(consume is an optimization on this: avoiding a memory barrier in the reader by letting the compiler take advantage of hardware guarantees as long as you follow some dependency rules.)


You have some misconceptions about what CPU cache is, and about what volatile does, which I commented about under the question. A release-store makes sure earlier non-atomic assignments are visible in memory.

(Also, cache is coherent; it provides all CPUs with a shared view of memory that they can agree on. Registers are thread-private and not coherent, that's what people mean when they say a value is "cached". Registers are not CPU cache, but software can use them to hold a copy of something from memory. When to use volatile with multi threading? - never, but it does have some effects in real CPUs because they have coherent cache. It's a bad way to roll your own mo_relaxed. See also https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/)

In practice on real CPUs, memory reordering happens locally within each core; cache itself is coherent and never gets "out of sync". (Other copies are invalided before a store can become globally visible). So release just has to make sure the local CPUs stores become globally visible (commit to L1d cache) in the right order. ISO C++ doesn't specify any of that level of detail, and an implementation that worked very differently is hypothetically possible.

Making the writer's store volatile is irrelevant in practice because a non-atomic assignment followed by a release-store already has to make everything visible to other threads that might do an acquire-load and sync with that release store. It's irrelevant on paper in pure ISO C++ because it doesn't avoid data-race UB.

(Of course, it's theoretically possible for whole-program optimization to see that there are no acquire or consume loads that would ever load this store, and optimize away the release property. But compilers currently don't optimize atomics in general even locally, and never try to do that kind of whole-program analysis. So code-gen for writer functions will assume that there might be a reader that syncs with any given store of release or seq_cst ordering.)


What does memory_order_consume really do?

One thing mo_consume does is to make sure the compiler uses a barrier instruction on implementations where the underlying hardware doesn't provide dependency ordering naturally / for free. In practice that means only on DEC Alpha. Dependent loads reordering in CPU / Memory order consume usage in C11

Your question is a near duplicate of C++11: the difference between memory_order_relaxed and memory_order_consume - see the answers there for the body of your question about misguided attempts to do stuff with volatile and relaxed. (I'm mostly answering because of the title question.)

It also ensures that the compiler uses a barrier at some point before execution passes into code that doesn't know about the data dependency this value carries. (i.e. no [[carries_dependency]] tag on the function arg in the declaration). Such code might replace x-x with a constant 0 and optimize away, losing the data dependency. But code that knows about the dependency would have to use something like a sub r1, r1, r1 instruction to get a zero with a data dependency.

That can't happen for your use-case (where relaxed will work in practice on ISAs other than Alpha), but the on-paper design of mo_consume allowed all kinds of stuff that would require different code-gen from what compilers would normally do. This is part of what made it so hard to implement efficiently that compilers just promote it to mo_acquire.

The other part of the problem is that it requires code to be littered with kill_dependency and/or [[carries_dependency]] all over the place, or you'll end up with a barrier at function boundaries anyway. These problems led the ISO C++ committee to temporarily discourage consume.

  • C++11: the difference between memory_order_relaxed and memory_order_consume
  • P0371R1: Temporarily discourage memory_order_consume and other C++ wg21 documents linked from that about why consume is discouraged.
  • Memory order consume usage in C11 - more about the hardware mechanism / guarantee that consume is intended to expose to software. Out-of-order exec can only reorder independent work anyway, not start a load before the load address is known, so on most CPUs enforcing dependency ordering happens for free anyway: only a few models of DEC Alpha could violate causality and effectively load data from before it had the pointer that gave it the address.

And BTW:

The example code is safe with release + consume regardless of volatile. It's safe on most compilers and most ISAs in practice with release store + relaxed load, although of course ISO C++ has nothing to say about the correctness of that code. But with the current state of compilers, that's a hack that some code makes (like the Linux kernel's RCU).

If you need that level of read-side scaling, you'll have to work outside of what ISO C++ guarantees. That means your code will have to make assumptions about how compilers work (and that you're running on a "normal" ISA that isn't DEC Alpha), which means you need to support some set of compilers (and maybe ISAs, although there aren't many multi-core ISAs around). The Linux kernel only cares about a few compilers (mostly recent GCC, also clang I think), and the ISAs that they have kernel code for.

like image 176
Peter Cordes Avatar answered Oct 21 '22 23:10

Peter Cordes


  1. volatile has nothing to do with multi-threading in c/c++, its sequential visibility side effect only occurs on single-thread program and usually use it only for telling compiler not optimize out this value. It is DIFFERENT from Java/C#.

  2. release/consume is all about data dependency, and it may build a dependency chain (which can be break by kill_dependency to avoid unnecessary barriers later).

  3. release/acquire forms a pair-wise synchronize-with/inter-thread happens-before relationship.

For your case, release/acquire would form the expected happens-before relationship. release/consume will also work because *g is dependent on g.

But note that with current compilers, consume is treated as a synonym for acquire, because it proved too hard to implement efficiently. see another answer

like image 26
Harold Avatar answered Oct 21 '22 23:10

Harold