Consider an atomic read-modify-write operation such as x.exchange(..., std::memory_order_acq_rel)
. For purposes of ordering with respect to loads and stores to other objects, is this treated as:
a single operation with acquire-release semantics?
Or, as an acquire load followed by a release store, with the added guarantee that other loads and stores to x
will observe both of them or neither?
If it's #2, then although no other operations in the same thread could be reordered before the load or after the store, it leaves open the possibility that they could be reordered in between the two.
As a concrete example, consider:
std::atomic<int> x, y;
void thread_A() {
x.exchange(1, std::memory_order_acq_rel);
y.store(1, std::memory_order_relaxed);
}
void thread_B() {
// These two loads cannot be reordered
int yy = y.load(std::memory_order_acquire);
int xx = x.load(std::memory_order_acquire);
std::cout << xx << ", " << yy << std::endl;
}
Is it possible for thread_B
to output 0, 1
?
If the x.exchange()
were replaced by x.store(1, std::memory_order_release);
then thread_B
could certainly output 0, 1
. Should the extra implicit load in exchange()
rule that out?
cppreference makes it sound like #1 is the case and 0, 1
is forbidden:
A read-modify-write operation with this memory order is both an acquire operation and a release operation. No memory reads or writes in the current thread can be reordered before or after this store.
But I can't find anything explicit in the standard to support this. Actually the standard says very little about atomic read-modify-write operations at all, except 31.4 (10) in N4860 which is just the obvious property that the read has to read the last value written before the write. So although I hate to question cppreference, I'm wondering if this is actually correct.
I'm also looking at how it's implemented on ARM64. Both gcc and clang compile thread_A
as essentially
ldaxr [x]
stlxr #1, [x]
str #1, [y]
(See on godbolt.) Based on my understanding of ARM64 semantics, and some tests (with a load of y
instead of a store), I think that the str [y]
can become visible before the stlxr [x]
(though of course not before the ldaxr
). This would make it possible for thread_B
to observe 0, 1
. So if #1 is true then it would seem that gcc and clang are both wrong, which I hesitate to believe.
Finally, as far as I can tell, replacing memory_order_acq_rel
with seq_cst
wouldn't change anything about this analysis, since it only adds semantics with respect to other seq_cst
operations, and we don't have any here.
I found What exact rules in the C++ memory model prevent reordering before acquire operations? which, if I understand it correctly, seems to agree that #2 is correct, and that 0, 1
could be observed. I'd still appreciate confirmation, as well as a check on whether the cppreference quote is actually wrong or if I'm misunderstanding it.
The read-modify-write operation ensures that you modify only the specific bits in a system register that you want to change. Individual bits in a system register control different system functionality. Modifying the wrong bits in a system register might cause your program to behave incorrectly.
Essentially memory_order_acq_rel provides read and write orderings relative to the atomic variable, while memory_order_seq_cst provides read and write ordering globally. That is, the sequentially consistent operations are visible in the same order across all threads.
Relaxed. Relaxed accesses are the absolute weakest. They can be freely re-ordered and provide no happens-before relationship. Still, relaxed operations are still atomic. That is, they don't count as data accesses and any read-modify-write operations done to them occur atomically.
Not an answer at the level of the language standard, but some evidence that in practice, the answer can be "two". And as I guessed in the question, this can happen even if the RMW is seq_cst
.
I haven't been able to observe stores being reordered as in the original question, but here is an example that shows the store of an atomic seq_cst
RMW being reordered with a following relaxed
load.
The program below is an implementation of Peterson's algorithm adapted from LWimsey's example in What's are practical example where acquire release memory order differs from sequential consistency?. As explained there, the correct version of the algorithm involves
me.store(true, std::memory_order_seq_cst);
if (other.load(std::memory_order_seq_cst) == false)
// lock taken
where it is essential that the load become visible after the store.
If RMW were a single operation for the purposes of ordering semantics, we would expect that it would be safe to do
me.exchange(true, std::memory_order_seq_cst);
if (other.load(std::memory_order_relaxed) == false) {
// Ensure critical section doesn't start until we know we have the lock
std::atomic_thread_fence(std::memory_order_seq_cst);
// lock taken
}
on the theory that since the exchange operation has acquire semantics, the load must become visible after the exchange has completed, and in particular after the store of true
to me
has become visible.
But in fact on ARMv8-a, using either gcc or clang, such code frequently fails. It appears that in fact, exchange
does consist of an acquire-load and a release-store, and that other.load
may become visible before the release-store. (Though not before the acquire-load of the exchange
, but that is irrelevant here.)
clang generates code like the following:
mov w11, #1
retry:
ldaxrb wzr, [me]
stlxrb w12, w11, [me]
cbnz w12, retry
ldrb w11, [other]
See https://godbolt.org/z/fhjjn7, lines 116-120 of the assembly output. (gcc is the same but buried inside a library function.) By ARM64 memory ordering semantics, the release-store stlxrb
can be reordered with following loads and stores. The fact that it's exclusive doesn't change that.
To make the reordering happen more often, we arrange for the data being stored to depend on a previous load that missed cache, which we ensure by evicting that line with dc civac
. We also need to put the two flags me
and other
on separate cache lines. Otherwise, as I understand it, even if thread A does its load before the store, then thread B has to wait to begin its RMW until after A's store completes, and in particular won't do its load until A's store is visible.
On a multi-core Cortex A72 (Raspberry Pi 4B), the assertion typically fails after a few thousand iterations, which is nearly instantaneous.
The code needs to be built with -O2
. I suspect it will not work if built for ARMv8.2 or higher, where swpalb
is available.
// Based on https://stackoverflow.com/a/41859912/634919 by LWimsey
#include <thread>
#include <atomic>
#include <cassert>
// size that's at least as big as a cache line
constexpr size_t cache_line_size = 256;
static void take_lock(std::atomic<bool> &me, std::atomic<bool> &other) {
alignas(cache_line_size) bool uncached_true = true;
for (;;) {
// Evict uncached_true from cache.
asm volatile("dc civac, %0" : : "r" (&uncached_true) : "memory");
// So the release store to `me` may be delayed while
// `uncached_true` is loaded. This should give the machine
// time to proceed with the load of `other`, which is not
// forbidden by the release semantics of the store to `me`.
me.exchange(uncached_true, std::memory_order_seq_cst);
if (other.load(std::memory_order_relaxed) == false) {
// taken!
std::atomic_thread_fence(std::memory_order_seq_cst);
return;
}
// start over
me.store(false, std::memory_order_seq_cst);
}
}
static void drop_lock(std::atomic<bool> &me) {
me.store(false, std::memory_order_seq_cst);
}
alignas(cache_line_size) std::atomic<int> counter{0};
static void critical_section(void) {
// We should be the only thread inside here.
int tmp = counter.fetch_add(1, std::memory_order_seq_cst);
assert(tmp == 0);
// Delay to give the other thread a chance to try the lock
for (int i = 0; i < 100; i++)
asm volatile("");
tmp = counter.fetch_sub(1, std::memory_order_seq_cst);
assert(tmp == 1);
}
static void busy(std::atomic<bool> *me, std::atomic<bool> *other)
{
for (;;) {
take_lock(*me, *other);
std::atomic_thread_fence(std::memory_order_seq_cst); // paranoia
critical_section();
std::atomic_thread_fence(std::memory_order_seq_cst); // paranoia
drop_lock(*me);
}
}
// The two flags need to be on separate cache lines.
alignas(cache_line_size) std::atomic<bool> flag1{false}, flag2{false};
int main()
{
std::thread t1(busy, &flag1, &flag2);
std::thread t2(busy, &flag2, &flag1);
t1.join(); // will never happen
t2.join();
return 0;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With