Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How many memory barriers instructions does an x86 CPU have?

I have found out that an x86 CPU have the following memory barriers instructions: mfence, lfence, and sfence.

Does an x86 CPU only have these three memory barriers instructions, or are there more?

like image 601
Steve Avatar asked Jan 29 '23 03:01

Steve


1 Answers

sfence (SSE1) and mfence / lfence (SSE2) are the only instructions that are named for their memory fence/barrier functionality. Unless you're using NT stores and/or WC memory (and NT loads from WC), only mfence is needed for memory ordering.

(Note that lfence on Intel CPUs is also a barrier for out-of-order execution, so it can serialize rdtsc, and is useful for Spectre mitigation to prevent speculative execution. On AMD, there's an MSR that has to be set, otherwise lfence is basically a nop (4/cycle throughput). That MSR was introduced with Spectre-mitigation microcode updates, and is normally set by updated kernels.)


locked instructions like lock add [mem], eax are also full memory barriers. Does lock xchg have the same behavior as mfence?. (Although possibly not as strong as mfence for ordering NT loads from WC memory: Do locked instructions provide a barrier between weakly-ordered accesses?). xchg [mem], reg has an implicit lock prefix, so it's also a barrier.

In my testing on Skylake, locked instructions do block reordering of NT stores with regular stores with this code https://godbolt.org/g/7Q9xgz.

xchg seems to be a good way to do a seq-cst store, especially on Intel hardware like Skylake where mfence also blocks out-of-order execution of pure ALU instructions, like lfence: See the bottom of this answer.

AMD also recommends using xchg or other locked instructions instead of mfence. (mfence is documented in the AMD manuals as serializing on AMD, so it will always have the penalty of blocking OoO exec).


For sequential-consistency stores or full barriers on 32-bit targets without SSE, compilers typically use lock or [esp], 0 or other no-op locked instruction just for the memory-barrier effect. That's what g++7.3 -O3 -m32 -mno-sse does for std::atomic_thread_fence(std::memory_order_seq_cst);.

But anyway, neither mfence nor locked insns are architecturally defined as serializing on Intel, regardless of implementation details on some CPUs.


Full serializing instructions like cpuid are also full memory barriers, draining the store buffer as well as flushing the pipeline. Does lock xchg have the same behavior as mfence? has relevant quotes from Intel's manual.

On Intel processors, the following are architecturally serializing instructions (From: https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-273.html):

  • Privileged serializing instructions — INVD, INVEPT, INVLPG, INVVPID, LGDT, LIDT, LLDT, LTR, MOV to control register, MOV (to debug register), WBINVD, and WRMSR.

    Exceptions: MOV CR8 isn't serializing. WRMSR to the IA32_TSC_DEADLINE MSR (MSR index 6E0H) and the X2APIC MSRs (MSR indices 802H to 83FH) are not serializing.

  • Non-privileged serializing instructions — CPUID, IRET1, and RSM

On AMD processors, the following are architecturally serializing instructions:

  • Privileged serializing instructions — INVD, INVLPG, LGDT, LIDT, LLDT, LTR, MOV to control register, MOV (to debug register), WBINVD, WRMSR, and SWAPGS.

  • Non-privileged serializing instructions — MFENCE, CPUID, IRET, and RSM

The term "[fully] serializing instruction" on Intel processors means the same exact thing as on AMD processors except for one difference: a cache line flushing operation from CLFLUSH (but not CLFLUSHOPT) is ordered with respect to later instructions by only MFENCE on AMD processors.


in / out (and their string-copy versions ins and outs) are full memory barriers, and also partially serializing (like lfence). The docs say they delay execution of the next instruction until after "the data phase" of the I/O transaction.


Footnotes:

(1) According to BJ137 (Sandy Bridge), HSD152 (Haswell), BDM103 (Broadwell):

Problem: An IRET instruction that results in a task switch by returning from a nested task does not serialize the processor (contrary to the Software Developer’s Manual Vol. 3 section titled "Serializing Instructions").

Implication: Software which depends on the serialization property of IRET during task switching may not behave as expected. Intel has not observed this erratum to impact the operation of any commercially available software.

Workaround: None identified. Software can execute an MFENCE instruction immediately prior to the IRET instruction if serialization is needed.

like image 116
Peter Cordes Avatar answered Feb 02 '23 02:02

Peter Cordes