Does the memory fence involve the kernel

Question

After asking this question, I've understood that the atomic instruction, such as test-and-set, would not involve the kernel. Only if a process needs to be put to sleep (to wait to acquire the lock) or woken (because it couldn't acquire the lock but now can), then the kernel has to be involved to perform the scheduling operations.

If so, does it mean that the memory fence, such as std::atomic_thread_fence in c++11, won't also involve the kernel?

Peter Cordes · Accepted Answer

std::atomic doesn't involve the kernel¹

On almost all normal CPUs (the kind we program for in real life), memory barrier instructions are unprivileged and get used directly by the compiler. The same way compilers know how to emit instructions like x86 lock add [rdi], eax for fetch_add (or lock xadd if you use the return value). Or on other ISAs, literally the same barrier instructions they use before/after loads, stores, and RMWs to give the required ordering. https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/

On some arbitrary hypothetical hardware and/or compiler, anything is of course possible, even if it would be catastrophically bad for performance.

In asm, a barrier just makes this core wait until some previous (program-order) operations are visible to other cores. It's a purely local operation. (At least, this is how real-word CPUs are designed, so that sequential consistency is recoverable with only local barriers to control local ordering of load and/or store operations. All cores share a coherent view of cache, maintained via a protocol like MESI. Non-coherent shared-memory systems exist, but implementations don't run C++ std::thread across them, and they typically don't run a single-system-image kernel.)

Footnote 1: (Even non-lock-free atomics usually use light-weight locking).

Also, ARM before ARMv7 apparently didn't have proper memory barrier instructions. On ARMv6, GCC uses mcr p15, 0, r0, c7, c10, 5 as a barrier.
Before that (g++ -march=armv5 and earlier), GCC doesn't know what to do and calls __sync_synchronize (a libatomic GCC helper function) which hopefully is implemneted somehow for whatever machine the code is actually running on. This may involve a system call on a hypothetical ARMv5 multi-core system, but more likely the binary will be running on an ARMv7 or v8 system where the library function can run a dmb ish. Or if it's a single-core system then it could be a no-op, I think. (C++ memory ordering cares about other C++ threads, not about memory order as seen by possible hardware devices / DMA. Normally implementations assume a multi-core system, but this library function might be a case where a single-core only implementation could be used.)

On x86 for example, std::atomic_thread_fence(std::memory_order_seq_cst) compiles to mfence. Weaker barriers like std::atomic_thread_fence(std::memory_order_release) only have to block compile-time reordering; x86's runtime hardware memory model is already acq/rel (seq-cst + a store buffer). So there aren't any asm instructions corresponding to the barrier. (One possible implementation for a C++ library would be GNU C asm("" ::: "memory");, but GCC/clang do have barrier builtins.)

std::atomic_signal_fence only ever has to block compile-time reordering, even on weakly-ordered ISAs, because all real-world ISAs guarantee that execution within a single thread sees its own operations as happening in program order. (Hardware implements this by having loads snoop the store buffer of the current core). VLIW and IA-64 EPIC, or other explicit-parallelism ISA mechanisms (like Mill with its delayed-visibility loads), still make it possible for the compiler to generate code that respects any C++ ordering guarantees involving the barrier if an async signal (or interrupt for kernel code) arrives after any instruction.

You can look at code-gen yourself on the Godbolt compiler explorer:

#include <atomic>
void barrier_sc(void) {
    std::atomic_thread_fence(std::memory_order_seq_cst);
}

x86: mfence.
POWER: sync.
AArch64: dmb ish (full barrier on "inner shareable" coherence domain).
ARM with gcc -mcpu=cortex-a15 (or -march=armv7): dmb ish
RISC-V: fence iorw,iorw

void barrier_acq_rel(void) {
    std::atomic_thread_fence(std::memory_order_acq_rel);
}

x86: nothing
POWER: lwsync (light-weight sync).
AArch64: still dmb ish
ARM: still dmb ish
RISC-V: still fence iorw,iorw

void barrier_acq(void) {
    std::atomic_thread_fence(std::memory_order_acquire);
}

x86: nothing
POWER: lwsync (light-weight sync).
AArch64: dmb ishld (load barrier, doesn't have to drain the store buffer)
ARM: still dmb ish, even with -mcpu=cortex-a53 (an ARMv8) :/
RISC-V: still fence iorw,iorw

Does the memory fence involve the kernel

Tags:

c++

performance

atomic

stdatomic

memory-barriers

Yves

1 Answers

std::atomic doesn't involve the kernel¹

Peter Cordes

Recent Activity

Donate For Us

Does the memory fence involve the kernel

Tags:

c++

performance

atomic

stdatomic

memory-barriers

Yves

1 Answers

std::atomic doesn't involve the kernel1

Peter Cordes

Related questions

Recent Activity

Donate For Us

std::atomic doesn't involve the kernel¹