Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why don't compilers merge redundant std::atomic writes?

I'm wondering why no compilers are prepared to merge consecutive writes of the same value to a single atomic variable, e.g.:

#include <atomic>
std::atomic<int> y(0);
void f() {
  auto order = std::memory_order_relaxed;
  y.store(1, order);
  y.store(1, order);
  y.store(1, order);
}

Every compiler I've tried will issue the above write three times. What legitimate, race-free observer could see a difference between the above code and an optimized version with a single write (i.e. doesn't the 'as-if' rule apply)?

If the variable had been volatile, then obviously no optimization is applicable. What's preventing it in my case?

Here's the code in compiler explorer.

like image 703
PeteC Avatar asked Aug 30 '17 12:08

PeteC


5 Answers

The C++11 / C++14 standards as written do allow the three stores to be folded/coalesced into one store of the final value. Even in a case like this:

  y.store(1, order);
  y.store(2, order);
  y.store(3, order); // inlining + constant-folding could produce this in real code

The standard does not guarantee that an observer spinning on y (with an atomic load or CAS) will ever see y == 2. A program that depended on this would have a data race bug, but only the garden-variety bug kind of race, not the C++ Undefined Behaviour kind of data race. (It's UB only with non-atomic variables). A program that expects to sometimes see it is not necessarily even buggy. (See below re: progress bars.)

Any ordering that's possible on the C++ abstract machine can be picked (at compile time) as the ordering that will always happen. This is the as-if rule in action. In this case, it's as if all three stores happened back-to-back in the global order, with no loads or stores from other threads happening between the y=1 and y=3.

It doesn't depend on the target architecture or hardware; just like compile-time reordering of relaxed atomic operations are allowed even when targeting strongly-ordered x86. The compiler doesn't have to preserve anything you might expect from thinking about the hardware you're compiling for, so you need barriers. The barriers may compile into zero asm instructions.


So why don't compilers do this optimization?

It's a quality-of-implementation issue, and can change observed performance / behaviour on real hardware.

The most obvious case where it's a problem is a progress bar. Sinking the stores out of a loop (that contains no other atomic operations) and folding them all into one would result in a progress bar staying at 0 and then going to 100% right at the end.

There's no C++11 std::atomic way to stop them from doing it in cases where you don't want it, so for now compilers simply choose never to coalesce multiple atomic operations into one. (Coalescing them all into one operation doesn't change their order relative to each other.)

Compiler-writers have correctly noticed that programmers expect that an atomic store will actually happen to memory every time the source does y.store(). (See most of the other answers to this question, which claim the stores are required to happen separately because of possible readers waiting to see an intermediate value.) i.e. It violates the principle of least surprise.

However, there are cases where it would be very helpful, for example avoiding useless shared_ptr ref count inc/dec in a loop.

Obviously any reordering or coalescing can't violate any other ordering rules. For example, num++; num--; would still have to be full barrier to runtime and compile-time reordering, even if it no longer touched the memory at num.


Discussion is under way to extend the std::atomic API to give programmers control of such optimizations, at which point compilers will be able to optimize when useful, which can happen even in carefully-written code that isn't intentionally inefficient. Some examples of useful cases for optimization are mentioned in the following working-group discussion / proposal links:

  • http://wg21.link/n4455: N4455 No Sane Compiler Would Optimize Atomics
  • http://wg21.link/p0062: WG21/P0062R1: When should compilers optimize atomics?

See also discussion about this same topic on Richard Hodges' answer to Can num++ be atomic for 'int num'? (see the comments). See also the last section of my answer to the same question, where I argue in more detail that this optimization is allowed. (Leaving it short here, because those C++ working-group links already acknowledge that the current standard as written does allow it, and that current compilers just don't optimize on purpose.)


Within the current standard, volatile atomic<int> y would be one way to ensure that stores to it are not allowed to be optimized away. (As Herb Sutter points out in an SO answer, volatile and atomic already share some requirements, but they are different). See also std::memory_order's relationship with volatile on cppreference.

Accesses to volatile objects are not allowed to be optimized away (because they could be memory-mapped IO registers, for example).

Using volatile atomic<T> mostly fixes the progress-bar problem, but it's kind of ugly and might look silly in a few years if/when C++ decides on different syntax for controlling optimization so compilers can start doing it in practice.

I think we can be confident that compilers won't start doing this optimization until there's a way to control it. Hopefully it will be some kind of opt-in (like a memory_order_release_coalesce) that doesn't change the behaviour of existing code C++11/14 code when compiled as C++whatever. But it could be like the proposal in wg21/p0062: tag don't-optimize cases with [[brittle_atomic]].

wg21/p0062 warns that even volatile atomic doesn't solve everything, and discourages its use for this purpose. It gives this example:

if(x) {
    foo();
    y.store(0);
} else {
    bar();
    y.store(0);  // release a lock before a long-running loop
    for() {...} // loop contains no atomics or volatiles
}
// A compiler can merge the stores into a y.store(0) here.

Even with volatile atomic<int> y, a compiler is allowed to sink the y.store() out of the if/else and just do it once, because it's still doing exactly 1 store with the same value. (Which would be after the long loop in the else branch). Especially if the store is only relaxed or release instead of seq_cst.

volatile does stop the coalescing discussed in the question, but this points out that other optimizations on atomic<> can also be problematic for real performance.


Other reasons for not optimizing include: nobody's written the complicated code that would allow the compiler to do these optimizations safely (without ever getting it wrong). This is not sufficient, because N4455 says LLVM already implements or could easily implement several of the optimizations it mentioned.

The confusing-for-programmers reason is certainly plausible, though. Lock-free code is hard enough to write correctly in the first place.

Don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much (currently not at all). It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).

like image 191
Peter Cordes Avatar answered Oct 31 '22 23:10

Peter Cordes


You are referring to dead-stores elimination.

It is not forbidden to eliminate an atomic dead store but it is harder to prove that an atomic store qualifies as such.

Traditional compiler optimizations, such as dead store elimination, can be performed on atomic operations, even sequentially consistent ones.
Optimizers have to be careful to avoid doing so across synchronization points because another thread of execution can observe or modify memory, which means that the traditional optimizations have to consider more intervening instructions than they usually would when considering optimizations to atomic operations.
In the case of dead store elimination it isn’t sufficient to prove that an atomic store post-dominates and aliases another to eliminate the other store.

from N4455 No Sane Compiler Would Optimize Atomics

The problem of atomic DSE, in the general case, is that it involves looking for synchronization points, in my understanding this term means points in the code where there is happen-before relationship between an instruction on a thread A and instruction on another thread B.

Consider this code executed by a thread A:

y.store(1, std::memory_order_seq_cst);
y.store(2, std::memory_order_seq_cst);
y.store(3, std::memory_order_seq_cst);

Can it be optimised as y.store(3, std::memory_order_seq_cst)?

If a thread B is waiting to see y = 2 (e.g. with a CAS) it would never observe that if the code gets optimised.

However, in my understanding, having B looping and CASsing on y = 2 is a data race as there is not a total order between the two threads' instructions.
An execution where the A's instructions are executed before the B's loop is observable (i.e. allowed) and thus the compiler can optimise to y.store(3, std::memory_order_seq_cst).

If threads A and B are synchronized, somehow, between the stores in thread A then the optimisation would not be allowed (a partial order would be induced, possibly leading to B potentially observing y = 2).

Proving that there is not such a synchronization is hard as it involves considering a broader scope and taking into account all the quirks of an architecture.

As for my understanding, due to the relatively small age of the atomic operations and the difficulty in reasoning about memory ordering, visibility and synchronization, compilers don't perform all the possible optimisations on atomics until a more robust framework for detecting and understanding the necessary conditions is built.

I believe your example is a simplification of the counting thread given above, as it doesn't have any other thread or any synchronization point, for what I can see, I suppose the compiler could have optimised the three stores.

like image 26
Margaret Bloom Avatar answered Oct 31 '22 22:10

Margaret Bloom


While you are changing the value of an atomic in one thread, some other thread may be checking it and performing an operation based on the value of the atomic. The example you gave is so specific that compiler developers don't see it worth optimizing. However, if one thread is setting e.g. consecutive values for an atomic: 0, 1, 2, etc., the other thread may be putting something in the slots indicated by the value of the atomic.

like image 7
Serge Rogatch Avatar answered Oct 31 '22 22:10

Serge Rogatch


NB: I was going to comment this but it's a bit too wordy.

One interesting fact is that this behavior isn't in the terms of C++ a data race.

Note 21 on p.14 is interesting: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf (my emphasis):

The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic

Also on p.11 note 5 :

“Relaxed” atomic operations are not synchronization operations even though, like synchronization operations, they cannot contribute to data races.

So a conflicting action on an atomic is never a data race - in terms of the C++ standard.

These operations are all atomic (and specifically relaxed) but no data race here folks!

I agree there's no reliable/predictable difference between these two on any (reasonable) platform:

include <atomic>
std::atomic<int> y(0);
void f() {
  auto order = std::memory_order_relaxed;
  y.store(1, order);
  y.store(1, order);
  y.store(1, order);
}

and

include <atomic>
std::atomic<int> y(0);
void f() {
  auto order = std::memory_order_relaxed;
  y.store(1, order);
}

But within the definition provided C++ memory model it isn't a data race.

I can't easily understand why that definition is provided but it does hand the developer a few cards to engage in haphazard communication between threads that they may know (on their platform) will statistically work.

For example, setting a value 3 times then reading it back will show some degree of contention for that location. Such approaches aren't deterministic but many effective concurrent algorithms aren't deterministic. For example, a timed-out try_lock_until() is always a race condition but remains a useful technique.

What it appears the C++ Standard is providing you with certainty around 'data races' but permitting certain fun-and-games with race conditions which are on final analysis different things.

In short the standard appears to specify that where other threads may see the 'hammering' effect of a value being set 3 times, other threads must be able to see that effect (even if they sometimes may not!). It's the case where pretty much all modern platforms that other thread may under some circumstances see the hammering.

like image 5
Persixty Avatar answered Oct 31 '22 21:10

Persixty


In short, because the standard (for example the paragaraphs around and below 20 in [intro.multithread]) disallows for it.

There are happens-before guarantees which must be fulfilled, and which among other things rule out reordering or coalescing writes (paragraph 19 even says so explicitly about reordering).

If your thread writes three values to memory (let's say 1, 2, and 3) one after another, a different thread may read the value. If, for example, your thread is interrupted (or even if it runs concurrently) and another thread also writes to that location, then the observing thread must see the operations in exactly the same order as they happen (either by scheduling or coincidence, or whatever reason). That's a guarantee.

How is this possible if you only do half of the writes (or even only a single one)? It isn't.

What if your thread instead writes out 1 -1 -1 but another one sporadically writes out 2 or 3? What if a third thread observes the location and waits for a particular value that just never appears because it's optimized out?

It is impossible to provide the guarantees that are given if stores (and loads, too) aren't performed as requested. All of them, and in the same order.

like image 2
Damon Avatar answered Oct 31 '22 21:10

Damon