Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?

As we know from a previous answer to Does it make any sense instruction LFENCE in processors x86/x86_64? that we can not use SFENCE instead of MFENCE for Sequential Consistency.

An answer there suggests that MFENCE = SFENCE+LFENCE, i.e. that LFENCE does something without which we can not provide Sequential Consistency.

LFENCE makes impossible to reordering:

SFENCE
LFENCE
MOV reg, [addr]

-- To -->

MOV reg, [addr]
SFENCE
LFENCE

For example reordering of MOV [addr], reg LFENCE --> LFENCE MOV [addr], reg provided by mechanism - Store Buffer, which reorders Store - Loads for performance increase, and beacause LFENCE does not prevent to it. And SFENCE disables this mechanism.

What mechanism disables the LFENCE to make impossible reordering (x86 have not mechanism - Invalidate-Queue)?

And is reordering of SFENCE MOV reg, [addr] --> MOV reg, [addr] SFENCE possible only in theory or perhaps in reality? And if possible, in reality, what mechanisms, how does it work?

like image 882
Alex Avatar asked Dec 23 '14 21:12

Alex


2 Answers

x86 fence instructions can be briefly described as follows:

  • MFENCE prevents any later loads or stores from becoming globally observable before any earlier loads or stores. It drains the store buffer before later loads1 can execute.

  • LFENCE blocks instruction dispatch (Intel's terminology) until all earlier instructions retire. This is currently implemented by draining the ROB (ReOrder Buffer) before later instructions can issue into the back-end.

  • SFENCE only orders stores against other stores, i.e. prevents NT stores from committing from the store buffer ahead of SFENCE itself. But otherwise SFENCE is just like a plain store that moves through the store buffer. Think of it like putting a divider on a grocery-store checkout conveyor belt that stops NT stores from getting grabbed early. It does not necessarily force the store buffer to be drained before it retires from the ROB, so putting LFENCE after it doesn't add up to MFENCE.

  • A "serializing instruction" like CPUID (and IRET, etc) drains everything (ROB, store buffer) before later instructions can issue into the back-end. MFENCE + LFENCE would also do that, but true serializing instructions might also have other effects, I don't know.

These descriptions are a little ambiguous in terms of exactly what kind of operations are ordered and there are some differences across vendors (e.g. SFENCE is stronger on AMD) and even processors from the same vendor. Refer to the Intel's manual and specification updates and AMD's manual and revision guides for more information. There are also a lot of other discussions on these instructions on SO other other places. But read the official sources first. The descriptions above are I think the minimum specified on-paper behaviour across vendors.

Footnote 1: OoO exec of later stores don't need to be blocked by MFENCE; executing them just writes data into the store buffer. In-order commit already orders them after earlier stores, and commit after retirement orders wrt. loads (because x86 requires loads to complete, not just to start, before they can retire, as part of ensuring load ordering). Remember that x86 hardware is built to disallow reordering other than StoreLoad.

The Intel manual Volume 2 number 325383-072US describes SFENCE as an instructions that "ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible." Volume 3 Section 11.10 says that the store buffer is drained when using the SFENCE. The correct interpretation of this statement is exactly the earlier statement from Volume 2. So SFENCE can be said to drain the store buffer in that sense. There is no guarantee at what point during SFENCE's lifetime earlier stores achieve GO. For any earlier store, it could happen before, at, or after retirement of SFENCE. Regarding what the point of GO is, it depends on serveral factors. This is beyond the scope of the question. See: Why “movnti” followed by an “sfence” guarantees persistent ordering?.

MFENCE does have to prevent NT stores from reordering with other stores, so it has to include whatever SFENCE does, as well as draining the store buffer. And also reordering of weakly-ordered SSE4.1 NT loads from WC memory, which is harder because the normal rules that get load ordering for free no longer apply to those. Guaranteeing this is why a Skylake microcode update strengthened (and slowed) MFENCE to also drain the ROB like LFENCE. It might still be possible for MFENCE to be lighter weight than that with HW support for optionally enforcing ordering of NT loads in the pipeline.


The main reason why SFENCE + LFENCE is not equal to MFENCE is because SFENCE + LFENCE doesn't block StoreLoad reordering, so it's not sufficient for sequential consistency. Only mfence (or a locked operation, or a real serializing instruction like cpuid) will do that. See Jeff Preshing's Memory Reordering Caught in the Act for a case where only a full barrier is sufficient.


From Intel's instruction-set reference manual entry for sfence:

The processor ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible.

but

It is not ordered with respect to memory loads or the LFENCE instruction.


LFENCE forces earlier instructions to "complete locally" (i.e. retire from the out-of-order part of the core), but for a store or SFENCE that just means putting data or a marker in the memory-order buffer, not flushing it so the store becomes globally visible. i.e. SFENCE "completion" (retirement from the ROB) doesn't include flushing the store buffer.

This is like Preshing describes in Memory Barriers Are Like Source Control Operations, where StoreStore barriers aren't "instant". Later in that that article, he explains why a #StoreStore + #LoadLoad + a #LoadStore barrier doesn't add up to a #StoreLoad barrier. (x86 LFENCE has some extra serialization of the instruction stream, but since it doesn't flush the store buffer the reasoning still holds).

LFENCE is not fully serializing like cpuid (which is as strong a memory barrier as mfence or a locked instruction). It's just LoadLoad + LoadStore barrier, plus some execution serialization stuff which maybe started as an implementation detail but is now enshrined as a guarantee, at least on Intel CPUs. It's useful with rdtsc, and for avoiding branch speculation to mitigate Spectre.


BTW, SFENCE is a no-op for WB (normal) stores.

It orders WC stores (such as movnt, or stores to video RAM) with respect to any stores, but not with respect to loads or LFENCE. Only on a CPU that's normally weakly-ordered does a store-store barrier do anything for normal stores. You don't need SFENCE unless you're using NT stores or memory regions mapped WC. If it did guarantee draining the store buffer before it could retire, you could build MFENCE out of SFENCE+LFENCE, but that isn't the case for Intel.


The real concern is StoreLoad reordering between a store and a load, not between a store and barriers, so you should look at a case with a store, then a barrier, then a load.

mov  [var1], eax
sfence
lfence
mov   eax, [var2]

can become globally visible (i.e. commit to L1d cache) in this order:

lfence
mov   eax, [var2]     ; load stays after LFENCE

mov  [var1], eax      ; store becomes globally visible before SFENCE
sfence                ; can reorder with LFENCE
like image 74
Peter Cordes Avatar answered Nov 17 '22 00:11

Peter Cordes


In general MFENCE != SFENCE + LFENCE. For example the code below, when compiled with -DBROKEN, fails on some Westmere and Sandy Bridge systems but appears to work on Ryzen. In fact on AMD systems just an SFENCE seems to be sufficient.

#include <atomic>
#include <thread>
#include <vector>
#include <iostream>
using namespace std;

#define ITERATIONS (10000000)
class minircu {
        public:
                minircu() : rv_(0), wv_(0) {}
                class lock_guard {
                        minircu& _r;
                        const std::size_t _id;
                        public:
                        lock_guard(minircu& r, std::size_t id) : _r(r), _id(id) { _r.rlock(_id); }
                        ~lock_guard() { _r.runlock(_id); }
                };
                void synchronize() {
                        wv_.store(-1, std::memory_order_seq_cst);
                        while(rv_.load(std::memory_order_relaxed) & wv_.load(std::memory_order_acquire));
                }
        private:
                void rlock(std::size_t id) {
                        rab_[id].store(1, std::memory_order_relaxed);
#ifndef BROKEN
                        __asm__ __volatile__ ("mfence;" : : : "memory");
#else
                        __asm__ __volatile__ ("sfence; lfence;" : : : "memory");
#endif
                }
                void runlock(std::size_t id) {
                        rab_[id].store(0, std::memory_order_release);
                        wab_[id].store(0, std::memory_order_release);
                }
                union alignas(64) {
                        std::atomic<uint64_t>           rv_;
                        std::atomic<unsigned char>      rab_[8];
                };
                union alignas(8) {
                        std::atomic<uint64_t>           wv_;
                        std::atomic<unsigned char>      wab_[8];
                };
};

minircu r;

std::atomic<int> shared_values[2];
std::atomic<std::atomic<int>*> pvalue(shared_values);
std::atomic<uint64_t> total(0);

void r_thread(std::size_t id) {
    uint64_t subtotal = 0;
    for(size_t i = 0; i < ITERATIONS; ++i) {
                minircu::lock_guard l(r, id);
                subtotal += (*pvalue).load(memory_order_acquire);
    }
    total += subtotal;
}

void wr_thread() {
    for (size_t i = 1; i < (ITERATIONS/10); ++i) {
                std::atomic<int>* o = pvalue.load(memory_order_relaxed);
                std::atomic<int>* p = shared_values + i % 2;
                p->store(1, memory_order_release);
                pvalue.store(p, memory_order_release);

                r.synchronize();
                o->store(0, memory_order_relaxed); // should not be visible to readers
    }
}

int main(int argc, char* argv[]) {
    std::vector<std::thread> vec_thread;
    shared_values[0] = shared_values[1] = 1;
    std::size_t readers = (argc > 1) ? ::atoi(argv[1]) : 8;
    if (readers > 8) {
        std::cout << "maximum number of readers is " << 8 << std::endl; return 0;
    } else
        std::cout << readers << " readers" << std::endl;

    vec_thread.emplace_back( [=]() { wr_thread(); } );
    for(size_t i = 0; i < readers; ++i)
        vec_thread.emplace_back( [=]() { r_thread(i); } );
    for(auto &i: vec_thread) i.join();

    std::cout << "total = " << total << ", expecting " << readers * ITERATIONS << std::endl;
    return 0;
}
like image 5
Alexander Korobka Avatar answered Nov 16 '22 22:11

Alexander Korobka