I have already seen this answer and this answer, but neither appears to clear and explicit about the equivalence or non-equivalence of mfence
and xchg
under the assumption of no non-temporal instructions.
The Intel instruction reference for xchg
mentions that this instruction is useful for implementing semaphores or similar data structures for process synchronization, and further references Chapter 8 of Volume 3A. That reference states the following.
For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
The mfence
documentation claims the following.
Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction. 1 The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction). MFENCE does not serialize the instruction stream.
If we ignore weakly ordered memory types, does xchg (which implies lock
) encompass all of mfence's guarantees with respect to memory ordering?
Assuming you're not writing a device-driver (so all the memory is Write-Back, not weakly-ordered Write-Combining), then yes xchg
is as strong as mfence
.
NT stores are fine.
I'm sure that this is the case on current hardware, and fairly sure that this is guaranteed by the wording in the manuals for all future x86 CPUs. xchg
is a very strong full memory barrier.
Hmm, I haven't looked at prefetch instruction reordering. That might possibly be relevant for performance, or possibly even correctness in weird device-driver situations (where you're using cacheable memory when you probably shouldn't be).
From your quote:
(P4/Xeon) Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
That's the one thing that makes xchg [mem]
weaker then mfence
(on Pentium4? Probably also on Sandybridge-family).
mfence
does guarantee that, which is why Skylake had to strengthen it to fix an erratum. (Are loads and stores the only instructions that gets reordered?, and also the answer you linked on Does lock xchg have the same behavior as mfence?)
NT stores are serialized by xchg
/ lock
, it's only weakly-ordered loads that may not be serialized. You can't do weakly-ordered loads from WB memory. movntdqa xmm, [mem]
on WB memory is still strongly-ordered (and on current implementations, also ignores the NT hint instead of doing anything to reduce cache pollution).
It looks like xchg
performs better for seq-cst stores than mov
+mfence
on current CPUs, so you should use that in normal code. (You can't accidentally map WC memory; normal OSes will always give you WB memory for normal allocations. WC is only used for video RAM or other device memory.)
These guarantees are specified in terms of specific families of Intel microarchitectures. It would be nice if there was some common "baseline x86" guarantees that we could assume for future Intel and AMD CPUs.
I assume but haven't checked that the xchg
vs. mfence
situation is the same on AMD. I'm sure there's no correctness problem with using xchg
as a seq-cst store, because that's what compilers other than gcc actually do.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With