This question is a follow-up/clarification to this:
Does the MOV x86 instruction implement a C++11 memory_order_release atomic store?
This states the MOV
assembly instruction is sufficient to perform acquire-release semantics on x86. We do not need LOCK
, fences or xchg
etc. However, I am struggling to understand how this works.
Intel doc Vol 3A Chapter 8 states:
https://software.intel.com/sites/default/files/managed/7c/f1/253668-sdm-vol-3a.pdf
In a single-processor (core) system....
- Reads are not reordered with other reads.
- Writes are not reordered with older reads.
- Writes to memory are not reordered with other writes, with the following exceptions:
but this is for a single core. The multi-core section does not seem to mention how loads are enforced:
In a multiple-processor system, the following ordering principles apply:
- Individual processors use the same ordering principles as in a single-processor system.
- Writes by a single processor are observed in the same order by all processors.
- Writes from an individual processor are NOT ordered with respect to the writes from other processors.
- Memory ordering obeys causality (memory ordering respects transitive visibility).
- Any two stores are seen in a consistent order by processors other than those performing the stores
- Locked instructions have a total order.
So how can MOV
alone can facilitate acquire-release?
but this is for a single core. The multi-core section does not seem to mention how loads are enforced:
The first bullet point in that section is key: Individual processors use the same ordering principles as in a single-processor system. The implicit part of that statement is ... when loading/storing from cache-coherent shared memory. i.e. multi-processor systems don't introduce new ways for reordering, they just mean the possible observers now include code on other cores instead of just DMA / IO devices.
The model for reordering of access to shared memory is the single-core model, i.e. program order + a store buffer = basically acq_rel. Actually slightly stronger than acq_rel, which is fine.
The only reordering that happens is local, within each CPU core. Once a store becomes globally visible, it becomes visible to all other cores at the same time, and didn't become visible to any cores before that. (Except to the core doing the store, via store forwarding.) That's why only local barriers are sufficient to recover sequential consistency on top of a SC + store-buffer model. (For x86, just mo_seq_cst
just needs mfence
after SC stores, to drain the store buffer before any further loads can execute.
mfence
and lock
ed instructions (which are also full barriers) don't have to bother other cores, just make this one wait).
One key point to understand is that there is a coherent shared view of memory (through coherent caches) that all processors share. The very top of chapter 8 of Intel's SDM defines some of this background:
These multiprocessing mechanisms have the following characteristics:
- To maintain system memory coherency — When two or more processors are attempting simultaneously to access the same address in system memory, some communication mechanism or memory access protocol must be available to promote data coherency and, in some instances, to allow one processor to temporarily lock a memory location.
- To maintain cache consistency — When one processor accesses data cached on another processor, it must not receive incorrect data. If it modifies data, all other processors that access that data must receive the modified data.
- To allow predictable ordering of writes to memory — In some circumstances, it is important that memory writes be observed externally in precisely the same order as programmed.
- [...]
The caching mechanism and cache consistency of Intel 64 and IA-32 processors are discussed in Chapter 11.
(CPUs use some variant of MESI; Intel in practice uses MESIF, AMD in practice uses MOESI.)
The same chapter also includes some litmus tests that help illustrate / define the memory model. The parts you quoted aren't really a strictly formal definition of the memory model. But the section 8.2.3.2 Neither Loads Nor Stores Are Reordered with Like Operations shows that loads aren't reordered with loads. Another section also shows that LoadStore reordering is forbidden. Acq_rel is basically blocking all reordering except StoreLoad, and that's what x86 does. (https://preshing.com/20120913/acquire-and-release-semantics/ and https://preshing.com/20120930/weak-vs-strong-memory-models/)
Related:
In general, most weaker memory HW models also only allow local reordering so barriers are still only local within a CPU core, just making (some part of) that core wait until some condition. (e.g. x86 mfence blocks later loads and stores from executing until the store buffer drains. Other ISAs also benefit from light-weight barriers for efficiency for stuff that x86 enforces between every memory operation, e.g. blocking LoadLoad and LoadStore reordering. https://preshing.com/20120930/weak-vs-strong-memory-models/)
A few ISAs (only PowerPC these days) allow stores to become visible to some other cores before becoming visible to all, allowing IRIW reordering. Note that mo_acq_rel
in C++ allows IRIW reordering; only seq_cst
forbids it. Most HW memory models are slightly stronger than ISO C++ and make it impossible, so all cores agree on the global order of stores.
Refreshing the semantics of acquire and release (quoting cppreference rather than the standard, because it's what I have on hand - the standard is more...verbose, here):
memory_order_acquire: A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load. All writes in other threads that release the same atomic variable are visible in the current thread
memory_order_release: A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable
This gives us four things to guarantee:
Reviewing the guarantees:
- Reads are not reordered with other reads.
- Writes are not reordered with older reads.
- Writes to memory are not reordered with other writes [..]
- Individual processors use the same ordering principles as in a single-processor system.
This is sufficient to satisfy the ordering guarantees.
For acquire ordering, consider a read of the atomic has occurred: for that thread, clearly any later read or write migrating before would violate the first or second bullet points, respectively.
For release ordering, consider a write of the atomic has occurred: for that thread, clearly any prior reads or write migrating after would violate the second or third bullet points, respectively.
The only thing left is to ensure that if a thread reads a released store, it will see all the other loads the writer thread had produced up to that point. This is where the other multi-processor guarantee is needed.
- Writes by a single processor are observed in the same order by all processors.
This is sufficient to satisfy acquire-release synchronization.
We've already established that when the release write occurs, all other writes prior to it will have also occurred. This bullet point then ensures that if another thread reads the released write, it will read all the writes the writer produced up to that point. (If it does not, then it would be observing that single processor's writes in a different order than the single processor, violating the bullet point.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With