Why is acquire semantics only for reads, not writes? How can an LL/SC acquire CAS take a lock without the store reordering with the critical section?

Question

To start with, consider release semantics. If a data set is protected with a spinlock (mutex, etc. - no matters what exact implementation is used; for now, assume 0 means it's free and 1 - busy). After changing of the data set, a thread stores 0 to spinlock address. To force visibility of all previous actions before storing 0 to spinlock address, storing is executed with release semantics, that means all previous reads and writes shall be made visible to other threads before this storing. It is implementation detail whether this is done with full barrier, or release mark of the single store operation. That is (I hope) clear without any doubt.

Then, consider them moment when spinlock ownership is being taken. To protect against race, this is any kind of compare-and-set operation. With single-instruction CAS implementation (X86, Sparc...), this is combined reading and writing. The same for X86 atomic XCHG. With LL/SC (most RISCs), this falls to:

Read (LL) the spinlock location until it shows free state. (Can be optimized with a kind of CPU stall.)
Write (SC) the value "occupied" (1 in our case). CPU exposes whether the operation was successful (condition flag, output register, etc.)
Check the write (SC) result and, if failed, go to step 1.

In all cases, the operation that shall be visible to other threads to show that spinlock is occupied, is writing of 1 to its location, and barrier shall be committed between this writing and following manipulations on the data set protected with the spinlock. Reading of this spinlock gives nothing to protection scheme, except permit of CAS or LL/SC operation.

But all really implemented schemes allow acquire semantics modification on reads (or CAS), not writes. As result, LL/SC scheme would require additional final read-with-acquire operation on the spinlock to commit the required barrier. But there is no such instruction in typical output. For example, if compile on ARM:

  for(;;) {
    int e{0};
    int d{1};
    if (std::atomic_compare_exchange_weak_explicit(p, &e, d,
          std::memory_order_acquire,
          std::memory_order_relaxed)) {
      return;
    }
  }

its output contains first LDAXR == LL+acquire, then STXR == SC (without barrier in it, so, there is no guarantee other threads will see it?) This is likely not my artifact but is generated e.g. in glibc: pthread_spin_trylock calls __atomic_compare_exchange_weak_acquire (and no more barriers), that falls into GCC builtin __atomic_compare_exchange_n with acquire on mutex reading and no release on mutex writing.

It seems Iʼve missed some principal detail in this consideration. Would anybody correct it?

This also could fall into 2 sub-questions:

SQ1: In instruction sequence like:

(1) load_linked+acquire mutex_address     ; found it is free
(2) store_conditional mutex_address       ; succeeded
(3) read or write of mutex-protected area

what prevents CPU against reordering (2) and (3), with result that other threads won't see mutex is locked?

SQ2: Is there a design factor that suggests having acquire semantics only on loads?

I've seen that some examples of lock-free code, such as:

thread 1:

var = value;
flag.store(true, std::memory_order_release);

thread 2:

if (flag.load(std::memory_order_acquire)) {
   // We already can access it!!!
   value = var;
   ... do something with value ...
}

but this should have been made working after the mutex-protected style gets working stably.

Peter Cordes · Accepted Answer

Its output contains first LDAXR == LL+acquire, then STXR == SC
(without barrier in it, so, there is no guarantee other threads will see it?)

Huh? Stores always become visible to other threads; the store buffer always drains itself as fast as possible. The question is only whether to block later loads/stores in this thread until the store buffer is empty. (That's required for seq-cst pure stores, for example, although actually only seq_cst loads need to wait for seq_cst stores to drain from the store buffer).

STXR is exclusive and tied to the LL. So it and the load are indivisible in the global order of operations, as the load and store side of an atomic RMW operation, just like x86 does in one instruction with lock cmpxchg. (This is actually an over-statement: For purposes of ordering, is atomic read-modify-write one operation or two? - you can observe some effects of the store side of an atomic RMW reordering with later operations even when the load can't. I'm not sure I fully understand how this is still safe, but it is.)

The atomic RMW can move earlier (because acquire loads can do that, and so can relaxed stores). But it can't move later (because acquire-loads can't do that). Therefore the atomic RMW appears in the global order before any operations in the critical section, and is sufficient for taking a lock. It doesn't have to wait for earlier operations like cache-miss stores; it can let them move into the critical section. But that's not a problem.

However, if you had used an acq_rel CAS, it couldn't take the lock until after finishing all earlier loads/stores (because of the release semantics of the store side).

I'm not sure if there's any asm difference between acq_rel and seq_cst for an atomic RMW. IIRC, yes on PowerPC. Not on x86, all RMWs are seq_cst. AArch64 ARMv8.3 has ldapr which allows acq_rel without seq_cst. Before that, only ldar stlr / stlxr: it only has relaxed and sequential-release.

LDAR + STR would be like x86 cmpxchg without a lock prefix: acquire load and separate store. (Except that the store side of x86 cmpxchg is still a release-store (but not sequential-release) because of the x86 memory model.

Other confirmation of my reasoning that mo_acquire for the "success" side of a CAS is sufficient for taking a lock:

https://en.cppreference.com/w/cpp/atomic/memory_order says "The lock() operation on a Mutex is also an acquire operation"
Glibc's pthread_spin_trylock uses the GCC builtin __atomic_compare_exchange_n on the mutex with only acquire, not acq_rel or seq_cst. We know many smart people have looked at glibc. And on platforms where it's not effectively strengthened to seq-cst asm, bug bugs probably would have been noticed if there were any.

what prevents CPU against reordering (2) and (3), with result that other threads won't see mutex is locked?

That would require other threads see the LL and SC as separate operations, not as an atomic RMW. The whole point of LL/SC is to prevent that. Weaker ordering lets it move around as a unit, not split apart.

SQ2: Is there a design factor that suggests having acquire semantics only on loads?

Yes, consider pure loads and pure stores, not RMWs. Jeff Preshing on acq and rel semantics.

The one-way barrier of a release-store naturally works well with the store buffer on real CPUs. CPUs "want" to load early and store late. Perhaps Jeff Preshing's article Memory Barriers Are Like Source Control Operations is a helpful analogy for how CPUs interact with coherent cache. (Also related: this answer mentions that along with why compile-time reordering can help compilers optimize)

A store that could only appear earlier, not later, would basically require flushing the store buffer. i.e. relaxed store followed by a full barrier (like atomic_thread_fence(seq_cst), e.g. ARM dsb ish or x86 mfence or locked operation). This is what you get from a seq-cst store. So we more or less already have a name for it, and it's very expensive.

Nate Eldredge · Answer

Yes, it is possible on real architectures for the store-conditional to reorder with the critical section. Nevertheless, implementing a mutex this way is safe and correct! You do not need an additional acquire-load or any other barriers.

The basic idea is the following. Regardless of any reordering with other accesses, the LL/SC pair is still exclusive: if the SC succeeds, you know that no stores to the lock took place between it and the LL. So suppose the LL returns zero (lock available), and the SC of 1 (lock taken) succeeds. You now own the lock.

But we can say more: for all intents and purposes, you have already owned it since the moment the LL returned zero, as nobody else could have taken it in between (or else the SC would have failed). Your ownership of the lock is effectively retroactive to the moment of the LL.

Therefore, it was safe to start speculatively executing the critical section as soon as the LL returned. It was not necessary to wait for the SC, provided that everything remained speculative until we were certain that the SC would succeed.

As such, it's sufficent to have an acquire barrier on the LL, since all we need to ensure is that the critical section is not reordered before the LL (which would obviously break everything). And of course, we need a release store when we unlock the mutex. It's this pairing between the acquire load on locking, and the prior release store on unlocking, that ensures the basic correctness property of a mutex: this instance of the critical section observes all the effects of the previous instance, and the previous instance observes none of the effects of this one.

It's true that if some other core should simply test the lock, rather than actually trying to take it for themselves, they may observe the effects of the critical section on the protected objects before the lock appears as taken. But that isn't something that you would ever do when using a mutex properly. The other core should not be accessing the protected objects at all unless it actually owns the lock. Under the scenario discussed above, assuming our core succeeded in taking the lock, any other core that was trying must have failed, and therefore they should not even be looking at the protected objects which we may be accessing "early" before the SC.

Related to this topic, see also:

Why does this spinlock require memory_order_acquire_release instead of just acquire?

Is memory_order_acquire really sufficient for locking a spinlock?

For ARMv8/v9 in particular, the architecture's memory model does indeed allow stxr, or even stlxr, to reorder with a following memory access, with some caveats and exceptions.

At For purposes of ordering, is atomic read-modify-write one operation or two? you can see an example, reproducible on real hardware, where a later ldr (or even ldapr if available) becomes visible before an earlier stlxr.

The question of whether a later store (str) could become visible before the stxr/stlxr is a little more delicate. We must only do the str if the stxr succeeded, with code such as:

retry:
    ldaxr x0, [lock]
    cbnz x0, failed_to_take_lock
    stxr w1, #1, [lock]
    cbnz w1, retry
    str x2, [shared_variable]
    @ ...
    stlr xzr, [lock]

So there's an apparent control dependency between the stxr and the str, and normally ARMv8 doesn't allow stores to be reordered through a control dependency. I think this is implied in B2.3.7 of the Architecture Reference Manual, the third clause in the definition of "Dependency-ordered-before". After all, if it turns out that a store should not have been executed, we can't allow it to have been seen by any other core (unless we somehow had the ability to roll back the other core too, which ARMv8 certainly cannot do).

However, there is an exception: in B2.3.6, in the first clause of the definition of "Dependency ordered through registers and memory", they exclude a register write generated by a Store Exclusive instruction. So if I understand this correctly, there is not considered to be a true control dependency here, and the str is allowed to become visible before the stxr.

In practice, I suppose this can only happen if we somehow know in advance that the stxr will succeed - since if it fails, we might load nonzero on the next iteration, in which case we would bail out and the str should never have happened at all. Perhaps an implementation could take the cache line in exclusive state at the ldaxr, and promise to hold it for some reasonable number of clock cycles, denying any requests for ownership from other cores that might arrive before then. So if we can look ahead and see that that the stxr will inevitably be executed within that time period, then we know it will succeed, and we could safely begin to execute past it in a non-speculative manner. I don't know if any real implementation can do this - I have not tried.

The stxr obviously can't be reordered with a later stlr (the release ordering of the latter would forbid it). If the exclusive store is stlxr, then it can't be reordered with a later ldar, since despite their names, stlr/ldar actually achieve sequentially consistent memory ordering: there's no StoreLoad reordering allowed among them. But stlxr can still be reordered with later ldapr which is true acquire ordering.

Why is acquire semantics only for reads, not writes? How can an LL/SC acquire CAS take a lock without the store reordering with the critical section?

Tags:

cpu-architecture

assembly

spinlock

stdatomic

compare-and-swap

Netch

2 Answers

Peter Cordes

Nate Eldredge

Recent Activity

Donate For Us

Why is acquire semantics only for reads, not writes? How can an LL/SC acquire CAS take a lock without the store reordering with the critical section?

Tags:

cpu-architecture

assembly

spinlock

stdatomic

compare-and-swap

Netch

2 Answers

Peter Cordes

Nate Eldredge

Related questions

Recent Activity

Donate For Us