Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to achieve a StoreLoad barrier in C++11?

I want to write portable code (Intel, ARM, PowerPC...) which solves a variant of a classic problem:

Initially: X=Y=0

Thread A:
  X=1
  if(!Y){ do something }
Thread B:
  Y=1
  if(!X){ do something }

in which the goal is to avoid a situation in which both threads are doing something. (It's fine if neither thing runs; this isn't a run-exactly-once mechanism.) Please correct me if you see some flaws in my reasoning below.

I am aware, that I can achieve the goal with memory_order_seq_cst atomic stores and loads as follows:

std::atomic<int> x{0},y{0};
void thread_a(){
  x.store(1);
  if(!y.load()) foo();
}
void thread_b(){
  y.store(1);
  if(!x.load()) bar();
}

which achieves the goal, because there must be some single total order on the
{x.store(1), y.store(1), y.load(), x.load()} events, which must agree with program order "edges":

  • x.store(1) "in TO is before" y.load()
  • y.store(1) "in TO is before" x.load()

and if foo() was called, then we have additional edge:

  • y.load() "reads value before" y.store(1)

and if bar() was called, then we have additional edge:

  • x.load() "reads value before" x.store(1)

and all these edges combined together would form a cycle:

x.store(1) "in TO is before" y.load() "reads value before " y.store(1) "in TO is before" x.load() "reads value before" x.store(true)

which violates the fact that orders have no cycles.

I intentionally use non-standard terms "in TO is before" and "reads value before" as opposed to standard terms like happens-before, because I want to solicit feedback about correctness of my assumption that these edges indeed imply happens-before relation, can be combined together in single graph, and the cycle in such combined graph is forbidden. I am not sure about that. What I know is this code produces correct barriers on Intel gcc & clang and on ARM gcc


Now, my real problem is a bit more complicated, because I have no control over "X" - it's hidden behind some macros, templates etc. and might be weaker than seq_cst

I don't even know if "X" is a single variable, or some other concept (e.g. a light-weight semaphore or mutex). All I know is that I have two macros set() and check() such that check() returns true "after" another thread has called set(). (It is also known that set and check are thread-safe and can't create data-race UB.)

So conceptually set() is somewhat like "X=1" and check() is like "X", but I have no direct access to atomics involved, if any.

void thread_a(){
  set();
  if(!y.load()) foo();
}
void thread_b(){
  y.store(1);
  if(!check()) bar();
}

I'm worried, that set() might be internally implemented as x.store(1,std::memory_order_release) and/or check() might be x.load(std::memory_order_acquire). Or hypothetically a std::mutex that one thread is unlocking and another is try_locking; in the ISO standard std::mutex is only guaranteed to have acquire and release ordering, not seq_cst.

If this is the case, then check()'s if body can be "reordered" before y.store(true) (See Alex's answer where they demonstrate that this happens on PowerPC).
This would be really bad, as now this sequence of events is possible:

  • thread_b() first loads the old value of x (0)
  • thread_a() executes everything including foo()
  • thread_b() executes everything including bar()

So, both foo() and bar() got called, which I had to avoid. What are my options to prevent that?


Option A

Try to force Store-Load barrier. This, in practice, can be achieved by std::atomic_thread_fence(std::memory_order_seq_cst); - as explained by Alex in a different answer all tested compilers emitted a full fence:

  • x86_64: MFENCE
  • PowerPC: hwsync
  • Itanuim: mf
  • ARMv7 / ARMv8: dmb ish
  • MIPS64: sync

The problem with this approach is, that I could not find any guarantee in C++ rules, that std::atomic_thread_fence(std::memory_order_seq_cst) must translate to full memory barrier. Actually, the concept of atomic_thread_fences in C++ seems to be at a different level of abstraction than the assembly concept of memory barriers and deals more with stuff like "what atomic operation synchronizes with what". Is there any theoretical proof that below implementation achieves the goal?

void thread_a(){
  set();
  std::atomic_thread_fence(std::memory_order_seq_cst)
  if(!y.load()) foo();
}
void thread_b(){
  y.store(true);
  std::atomic_thread_fence(std::memory_order_seq_cst)
  if(!check()) bar();
}

Option B

Use control we have over Y to achieve synchronization, by using read-modify-write memory_order_acq_rel operations on Y:

void thread_a(){
  set();
  if(!y.fetch_add(0,std::memory_order_acq_rel)) foo();
}
void thread_b(){
  y.exchange(1,std::memory_order_acq_rel);
  if(!check()) bar();
}

The idea here is that accesses to a single atomic (y) must be form a single order on which all observers agree, so either fetch_add is before exchange or vice-versa.

If fetch_add is before exchange then the "release" part of fetch_add synchronizes with the "acquire" part of exchange and thus all side effects of set() have to be visible to code executing check(), so bar() will not be called.

Otherwise, exchange is before fetch_add, then the fetch_add will see 1 and not call foo(). So, it is impossible to call both foo() and bar(). Is this reasoning correct?


Option C

Use dummy atomics, to introduce "edges" which prevent disaster. Consider following approach:

void thread_a(){
  std::atomic<int> dummy1{};
  set();
  dummy1.store(13);
  if(!y.load()) foo();
}
void thread_b(){
  std::atomic<int> dummy2{};
  y.store(1);
  dummy2.load();
  if(!check()) bar();
}

If you think the problem here is atomics are local, then imagine moving them to global scope, in the following reasoning it does not appear to matter to me, and I intentionally wrote the code in such a way to expose how funny it is that dummy1 and dummy2 are completely separate.

Why on Earth this might work? Well, there must be some single total order of {dummy1.store(13), y.load(), y.store(1), dummy2.load()} which has to be consistent with program order "edges":

  • dummy1.store(13) "in TO is before" y.load()
  • y.store(1) "in TO is before" dummy2.load()

(A seq_cst store + load hopefully form the C++ equivalent of a full memory barrier including StoreLoad, like they do in asm on real ISAs including even AArch64 where no separate barrier instructions are required.)

Now, we have two cases to consider: either y.store(1) is before y.load() or after in the total order.

If y.store(1) is before y.load() then foo() will not be called and we are safe.

If y.load() is before y.store(1), then combining it with the two edges we already have in program order, we deduce that:

  • dummy1.store(13) "in TO is before" dummy2.load()

Now, the dummy1.store(13) is a release operation, which releases effects of set(), and dummy2.load() is an acquire operation, so check() should see the effects of set() and thus bar() will not be called and we are safe.

Is it correct here to think that check() will see the results of set()? Can I combine the "edges" of various kinds ("program order" aka Sequenced Before, "total order", "before release", "after acquire") like that? I have serious doubts about this: C++ rules seem to talk about "synchronizes-with" relations between store and load on same location - here there is no such situation.

Note that we're only worried about the case where dumm1.store is known (via other reasoning) to be before dummy2.load in the seq_cst total order. So if they had been accessing the same variable, the load would have seen the stored value and synchronized with it.

(The memory-barrier / reordering reasoning for implementations where atomic loads and stores compile to at least 1-way memory barriers (and seq_cst operations can't reorder: e.g. a seq_cst store can't pass a seq_cst load) is that any loads/stores after dummy2.load definitely become visible to other threads after y.store. And similarly for the other thread, ... before y.load.)


You can play with my implementation of Options A,B,C at https://godbolt.org/z/u3dTa8

like image 449
qbolec Avatar asked Feb 04 '20 09:02

qbolec


1 Answers

@mpoeter explained why Options A and B are safe.

In practice on real implementations, I think Option A only needs std::atomic_thread_fence(std::memory_order_seq_cst) in Thread A, not B.

seq-cst stores in practice include a full memory barrier, or on AArch64 at least can't reorder with later acquire or seq_cst loads (stlr sequential-release has to drain from the store buffer before ldar can read from cache).

C++ -> asm mappings have a choice of putting the cost of draining the store buffer on atomic stores or atomic loads. The sane choice for real implementations is to make atomic loads cheap, so seq_cst stores include a full barrier (including StoreLoad). While seq_cst loads are the same as acquire loads on most.

(But not POWER; there even loads need heavy-weight sync = full barrier to stop store-forwarding from other SMT threads on the same core which could lead to IRIW reordering, because seq_cst requires all threads to be able to agree on the order of all seq_cst ops. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)

(Of course for a formal guarantee of safety, we do need a fence in both to promote acquire/release set() -> check() into a seq_cst synchronizes-with. Would also work for a relaxed set, I think, but a relaxed check could reorder with bar from the POV of other threads.)


I think the real problem with Option C is that it depends on some hypothetical observer that could synchronize-with y and the dummy operations. And thus we expect the compiler to preserve that ordering when making asm for a barrier-based ISA, where there is a single coherent shared memory state and barriers order this core/thread's access to that shared state. See also C11 Standalone memory barriers LoadLoad StoreStore LoadStore StoreLoad for more about this model vs. the stdatomic synchronizes-with ordering model for barriers weaker than seq_cst.

This is going to be true in practice on real ISAs; both threads include a full barrier or equivalent and compilers don't (yet) optimize atomics. But of course "compiling to a barrier-based ISA" isn't part of the ISO C++ standard. Coherent shared cache is the hypothetical observer that exists for asm reasoning but not for ISO C++ reasoning.

For Option C to work, we need an ordering like dummy1.store(13); / y.load() / set(); (as seen by Thread B) to violate some ISO C++ rule.

The thread running these statements has to behave as if set() executed first (because of Sequenced Before). That's fine, runtime memory ordering and/or compile time reordering of operations could still do that.

The two seq_cst ops d1=13 and y are consistent with the Sequenced Before (program order). set() doesn't participate in the required-to-exist global order for seq_cst ops because it's not seq_cst.

Thread B doesn't synchronize-with dummy1.store so no happens-before requirement on set relative to d1=13 applies, even though that assignment is a release operation.

I don't see any other possible rule violations; I can't find anything here that is required to be consistent with the set Sequenced-Before d1=13.

The "dummy1.store releases set()" reasoning is the flaw. That ordering only applies for a real observer that synchronizes-with it, or in asm. As @mpoeter answered, the existence of the seq_cst total order doesn't create or imply happens-before relationships, and that's the only thing that formally guarantees ordering outside of seq_cst.

Any kind of "normal" CPU with coherent shared cache where this reordering could really happen at runtime doesn't seems plausible. (But if a compiler could remove dummy1 and dummy2 then clearly we'd have a problem, and I think that's allowed by the standard.)

But since the C++ memory model isn't defined in terms of a store buffer, shared coherent cache, or litmus tests of allowed reordering, things required by sanity are not formally required by C++ rules. This is perhaps intentional to allow optimizing away even seq_cst variables that turn out to be thread private. (Current compilers don't do that, of course, or any other optimization of atomic objects.)

An implementation where one thread really could see set() last while another could see set() first sounds implausible. Not even POWER could do that; both seq_cst load and store include full barriers for POWER. (I had suggested in comments that IRIW reordering might be relevant here; C++'s acq/rel rules are weak enough to accommodate that, but the total lack of guarantees outside of synchronizes-with or other happens-before situations is much weaker than any HW.)

C++ doesn't guarantee anything for non-seq_cst unless there actually is an observer, and then only for that observer. Without one we're in Schroedinger's cat territory. Or, if two trees fall in the forest, did one fall before the other? (If it's a big forest, general relativity says it depends on the observer and that there's no universal concept of simultaneity.)


@mpoeter suggested a compiler could even remove the dummy load and store operations, even on seq_cst objects.

I think that may be correct when they can prove that nothing can synchronize with an operation. e.g. a compiler that can see that dummy2 doesn't escape the function can probably remove that seq_cst load.

This has at least one real-world consequence: if compiling for AArch64, that would allow an earlier seq_cst store to reorder in practice with later relaxed operations, which wouldn't have been possible with a seq_cst store + load draining the store buffer before any later loads could execute.

Of course current compilers don't optimize atomics at all, even though ISO C++ doesn't forbid it; that's an unsolved problem for the standards committee.

This is allowed I think because the C++ memory model doesn't have an implicit observer or a requirement that all threads agree on ordering. It does provide some guarantees based on coherent caches, but it doesn't require visibility to all threads to be simultaneous.

like image 50
Peter Cordes Avatar answered Sep 19 '22 07:09

Peter Cordes