Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is mov + mfence safe on NUMA?

I see that g++ generates a simple mov for x.load() and mov+mfence for x.store(y). Consider this classic example:

#include<atomic>
#include<thread>
std::atomic<bool> x,y;
bool r1;
bool r2;
void go1(){
    x.store(true);
}
void go2(){
    y.store(true);
}
bool go3(){
    bool a=x.load();
    bool b=y.load();
    r1 = a && !b;
}
bool go4(){
    bool b=y.load();
    bool a=x.load();
    r2= b && !a;
}





int main() {
    std::thread t1(go1);
    std::thread t2(go2);
    std::thread t3(go3);
    std::thread t4(go4);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
    return r1*2 + r2;
}

in which according to https://godbolt.org/z/APS4ZY go1 and go2 are translated to

go1():
        mov     BYTE PTR x[rip], 1
        mfence
        ret
go2():
        mov     BYTE PTR y[rip], 1
        mfence
        ret

For this example, I ask if it is possible for threads t3 and t4 to disagree about the order in which writes done by t1 and t2 "trickle down" to their respective views of memory. In particular consider a NUMA architecture, in which t3 happens to live "closer" to t1 and t4 is "closer" to t2. Can it happen that a store buffer of t1 or t2 "flushes prematurely" even before reaching mfence and then t3 or t4 has a chance of observing the write sooner than planned?

like image 405
qbolec Avatar asked Mar 04 '23 12:03

qbolec


1 Answers

Yes, it's safe. There's no special compiler option you need to enable for NUMA-safe code, because the asm doesn't need to be different.

NUMA is not even relevant for this; a multi-core single-socket x86 system can already do as much memory reordering as the x86 memory model allows. (Maybe less often or with smaller time windows.)


TLDR.1: you seem to be misunderstanding what mfence does. It's a local barrier for the core that ran it (including StoreLoad, the only reordering x86 does allow without barriers for non-NT loads/stores). It's totally irrelevant for this, even if x86 was weakly ordered: We're looking at 1 store each from different cores, so ordering of a single core's operations wrt. each other can't matter.

(mfence just makes that core wait to do any loads until after its store is globally visible. Nothing special happens when the store commits while mfence is waiting for it. Does a memory barrier ensure that the cache coherence has been completed?.)


TL:DR.2: Will two atomic writes to different locations in different threads always be seen in the same order by other threads? C++ allows different threads to disagree on the store order with relaxed or release stores (and acquire loads of course to rule out LoadLoad reordering), but not with seq_cst.

On architectures where it's possible, compilers need extra barriers on seq-cst stores to prevent it. On x86 it's not possible, full stop. Any x86-like system that allowed this reordering would not actually be x86, and would not be able to correctly run all x86 software.

All mainstream x86 systems you can buy are actually x86, with coherent caches and obey the x86 memory model.


x86's TSO memory model requires that all cores can agree on a Total Store Order

So the relevant rule is literally what the memory model is named after.

  • https://homes.cs.washington.edu/~bornholt/post/memory-models.html - some simple stuff about TSO vs. seq-cst. i.e. full ordering + a store buffer.
  • A better x86 memory model: x86-TSO (extended version) an attempt at a formal description of the x86 memory model. It doesn't mention NUMA because it's not relevant.

The TSO property follows directly from every core keeping its own stores private until they commit to L1d, and from having coherent caches.

The store buffer means that a core always sees its own stores before they become globally visible, unless it uses a StoreLoad barrier like mfence before reloading.

The only way for data to get between cores is by committing to L1d cache to make it globally visible; no sharing with some cores before others. (This is essential for TSO, regardless of NUMA).

The rest of the memory-ordering rules are mostly about internal reordering within a core: it makes sure its stores commit from the store buffer to L1d in program order, and after any earlier loads have already read their value. (And other internal rules to ensure LoadLoad ordering, including memory-order mis-speculation pipeline flushes if load-order speculation read a value that we lose the cache line for before we were "allowed" to have read the value.)

Data can only commit from a store buffer to a private L1d when that core has the relevant line in Modified state, which means every other core has it in Invalid state. This (along with the rest of the MESI rules) maintains coherency: there can't ever be conflicting copies of a cache line in different caches. So once a store has committed to cache, no other core can load a stale value. (What will be used for data exchange between threads are executing on one Core with HT?)

One common misconception is that stores have to percolate through the system before other CPUs stop loading stale values. That's 100% wrong in normal systems that use MESI to maintain coherent caches. It seems you're suffering from this misconception, too, when you talk about t3 being "closer" to t1. It can be true for DMA devices if you have non-coherent DMA, exactly because those DMA reads would not be coherent with the view of memory shared by CPUs participating in the MESI protocol. (But modern x86 has cache-coherent DMA, too.)

In fact, violating TSO requires some pretty funky behaviour, where a store becomes visible to some other cores before becoming visible to all. PowerPC does this in real life for logical threads on the same physical core snooping each other's retired stores that haven't yet committed to L1d cache. See my answer on Will two atomic writes to different locations in different threads always be seen in the same order by other threads? It's rareish even on weakly-ordered ISAs that allow it on paper.


Systems using x86 CPUs but with non-coherent shared memory are (or would be) very different beasts

(I'm not sure if any such beasts exist.)

That's more like tightly-coupled supercomputer clusters than single machines. If that's what you're thinking of, that's not just NUMA, it's fundamentally different and you can't run normal multi-threaded software across different coherency domains.

As Wikipedia says, essentially all NUMA systems are cache-coherent NUMA, aka ccNUMA.

Although simpler to design and build, non-cache-coherent NUMA systems become prohibitively complex to program in the standard von Neumann architecture programming model

Any non-coherent shared-memory system using x86 CPUs wouldn't be running a single kernel instance across different coherency domains. It would probably have a custom MPI library and/or other custom libraries to use the shared memory with explicit flushes / coherency to share data between coherency domains (systems).

Any threads you can start from a single process will definitely share a cache-coherent view of memory, and obey the x86 memory model, otherwise your system is broken / has hardware bugs. (I'm not aware of any such HW bugs existing and needing to be worked around in real hardware.)

A system with one or more Xeon Phi PCIe cards treats each Xeon Phi accelerator as a separate "system", because they're not coherent with main memory or each other, only internally coherent. See the bottom section of @Hadi's answer on How do data caches route the object in this example?. You might offload some work to a Xeon Phi accelerator, similar to how you'd offload work to a GPU, but that's done with something like message-passing. You would not have some threads running on the main Skylake (for example) CPU and other ordinary threads of the same process running on KNL cores on the Xeon Phi. If the Xeon Phi card was running an OS, it would be a separate instance of Linux or whatever from what's running on the host system.


x86 NUMA systems implement MESI by snooping the other sockets before loading from local DRAM, to maintain cache coherency.

And of course RFO (read-for-ownership) requests are broadcast to other socket(s).

New generations of Xeon have introduced more and more snoop settings to trade off different facets of performance. (e.g. more aggressive snooping costs more bandwidth on the link between sockets, but can reduce inter-core latency across sockets.)

  • https://software.intel.com/en-us/articles/intel-xeon-processor-e5-2600-v4-product-family-technical-overview has a table of Broadwell snoop modes vs. LLC hit latency, and each of local and remote mem latency and bandwidth.
  • http://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/

Chips that can work in Quad-socket and larger systems (E7 v1..4) have snoop filters; dual-socket E5 v1..4 just broadcasts snoops to the other socket using up a decent fraction of QPI bandwidth from what I've read. (This is for pre-Skylake-X Xeons, Broadwell and earlier. SKX uses a mesh network on chip, and might always have some kind of snoop filtering between sockets. I'm not sure what it does. BDW and earlier used the inclusive L3 cache as a snoop filter for local cores, but SKX has non-inclusive L3 and thus needs something else for snoop filtering even within a single socket.

AMD multi-socket chips used to use Hypertransport. Zen uses Infinity Fabric between clusters of 4 cores within one socket; I assume it uses that between sockets as well.

(Fun fact: multi-socket AMD K10 Opteron's Hypertransport could create tearing at 8-byte boundaries, while within a single socket 16-byte SIMD loads/stores were in practice atomic. SSE instructions: which CPUs can do atomic 16B memory operations? and Atomicity on x86. If you count that as reordering, that's one case where multi-socket can do more memory weirdness than single socket. But that's independent of NUMA per-se; you'd have the same thing with all the memory attached to one socket for a UMA setup.)


Related:

See also the duplicate links in What is the difference in logic and performance between LOCK XCHG and MOV+MFENCE? for xchg vs. mov+mfence. On modern CPUs, especially Skylake, mov+mfence is definitely slower for some ways of testing than xchg, and both are equivalent ways of doing a seq_cst store.

A release or relaxed store just needs a plain mov, and still has the same TSO ordering guarantees.

I think even weakly-ordered NT stores are still seen by all cores in an order they can agree upon. The "weakness" is in the order the become globally visible wrt. other loads+stores from the core doing them.

like image 196
Peter Cordes Avatar answered Mar 21 '23 08:03

Peter Cordes