Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does memory fencing blocks threads in multi-core CPUs?

I was reading the Intel instruction set guide 64-ia-32 guide to get an idea on memory fences. My question is that for an example with SFENCE, in order to make sure that all store operations are globally visible, does the multi-core CPU parks all the threads even running on other cores till the cache coherence achieved ?

like image 439
psaw.mora Avatar asked Aug 12 '18 13:08

psaw.mora


1 Answers

Barriers don't make other threads/cores wait. They make some operations in the current thread wait, depending on what kind of barrier it is. Out-of-order execution of non-memory instructions isn't necessarily blocked.

Barriers don't even make your loads/stores visible to other threads any faster; CPU cores already commit (retired) stores from the store buffer to L1d cache as fast as possible. (After all the necessary MESI coherency rules have been followed, and x86's strong memory model only allows stores to commit in program order even without barriers).

Barriers don't necessarily order instruction execution, they order global visibility, i.e. what comes out the far end of the store buffer.


mfence (or a locked operation like lock add or xchg [mem], reg) makes all later loads/stores in the current thread wait until all previous loads and stores are completed and globally visible (i.e. the store buffer is flushed).

mfence on Skylake is implemented in a way that stalls the whole core until the store buffer drains. See my answer on Are loads and stores the only instructions that gets reordered? for details; this extra slowdown was to fix an erratum. But locked operations and xchg aren't like that on Skylake; they're full memory barriers but they still allow out-of-order execution of imul eax, edx, so we have proof that they don't stall the whole core.

With hyperthreading, I think this stalling happens per logical thread, not the whole core.

But note that the mfence manual entry doesn't say anything about stalling the core, so future x86 implementations are free to make it more efficient (like a lock or dword [rsp], 0), and only prevent later loads from reading L1d cache without blocking later non-load instructions.


sfence only does anything if there are any NT stores in flight. It doesn't order loads at all, so it doesn't have to stop later instructions from executing. See Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?.

It just places a barrier in the store buffer that stops NT stores from reordering across it, and forces earlier NT stores to be globally visible before the sfence barrier can leave the store buffer. (i.e. write-combining buffers have to flush). But it can already have retired from the out-of-order execution part of the core (the ROB, or ReOrder Buffer) before it reaches the end of the store buffer.)

See also Does a memory barrier ensure that the cache coherence has been completed?


lfence as a memory barrier is nearly useless: it only prevents movntdqa loads from WC memory from reordering with later loads/stores. You almost never need that.

The actual use-cases for lfence mostly involve its Intel (but not AMD) behaviour that it doesn't allow later instructions to execute until it itself has retired. (so lfence; rdtsc on Intel CPUs lets you avoid having rdtsc read the clock too soon, as a cheaper alternative to cpuid; rdtsc)

Another important recent use-case for lfence is to block speculative execution (e.g. before a conditional or indirect branch), for Spectre mitigation. This is completely based on its Intel-guaranteed side effect of being partially serializing, and has nothing to do with its LoadLoad + LoadStore barrier effect.

lfence does not have to wait for the store buffer to drain before it can retire from the ROB, so no combination of LFENCE + SFENCE is as strong as MFENCE. Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?


Related: When should I use _mm_sfence _mm_lfence and _mm_mfence (when writing in C++ instead of asm).

Note that the C++ intrinsics like _mm_sfence also block compile-time memory ordering. This is often necessary even when the asm instruction itself isn't, because C++ compile-time reordering happens based on C++'s very weak memory model, not the strong x86 memory model which applies to the compiler-generated asm.

So _mm_sfence may make your code work, but unless you're using NT stores it's overkill. A more efficient option would be std::atomic_thread_fence(std::memory_order_release) (which turns into zero instructions, just a compiler barrier.) See http://preshing.com/20120625/memory-ordering-at-compile-time/.

like image 103
Peter Cordes Avatar answered Dec 05 '22 06:12

Peter Cordes