how are barriers/fences and acquire, release semantics implemented microarchitecturally?

Tags:

A lot of questions SO and articles/books such as https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2018.12.08a.pdf, Preshing's articles such as https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ and his entire series of articles, talk about memory ordering abstractly, in terms of the ordering and visibility guarantees provided by different barriers types. My question is how are these barriers and memory ordering semantics implemented on x86 and ARM micro architecturally ?

For store-store barriers, it seems like on the x86, the store buffer maintains program order of stores and commits them to L1D(and hence making them globally visible in the same order). If the store buffer is not ordered, ie does not maintain them in program order, how is a store store barrier implemented ? it is just "marking" the store buffer in such a way that that stores before barrier commit to the cache coherent domain before stores after ? or does the memory barrier actually flush the store buffer and stall all instructions until the flushing is complete ? Could it be implemented both ways ?

For load-load barriers, how is load-load reordering prevented ? It is hard to believe that x86 will execute all loads in order! I assume loads can execute out of order but commit/retire in order. If so, if a cpu executes 2 loads to 2 different locations ,how does one load ensure that it got a value from say T100 and the next one got it on or after T100 ? What if the first load misses in the cache and is waiting for data and the second load hits and gets its value. When load 1 gets its value how does it ensure that the value it got is not from a newer store that load 2's value ? if the loads can execute out of order, how are violations to memory ordering detected ?

Similarly how are load-store barriers(implicit in all loads for x86) implemented and how are store-load barriers(such as mfence) implemented ? ie what do the dmb ld/st and just dmb instructions do micro-architecturally on ARM, and what does every load and every store, and the mfence instruction do micro-architecturally on x86 to ensure memory ordering ?

665

asked Sep 23 '19 21:09

Raghu

1 Answers

Much of this has been covered in other Q&As (especially the later C++ How is release-and-acquire achieved on x86 only using MOV?), but I'll give a summary here. Still, good question, it's useful to collect this all in one place.

On x86, every asm load is an acquire-load. To implement that efficiently, modern x86 HW speculatively loads earlier than allowed and then checks that speculation. (Potentially resulting in a memory-order mis-speculation pipeline nuke.) To track this, Intel calls the combination of load and store buffers the "Memory Order Buffer".

Weakly-ordered ISAs don't have to speculate, they can just load in any order.

x86 store ordering is maintained by only letting stores commit from the store buffer to L1d in program order.

On Intel CPUs at least, a store-buffer entry is allocated for a store when it issues (from the front-end into the ROB + RS). All uops need to have a ROB entry allocated for them, but some uops also need to have other resources allocated, like load or store buffer entries, RAT entries for registers they read/write, and so on.

So I think the store buffer itself is ordered. When a store-address or store-data uop executes, it merely writes an address or data into its already-allocated store-buffer entry. Since commit (freeing SB entries) and allocate are both in program order, I assume it's physically a circular buffer with a head and tail, like the ROB. (And unlike the RS).

Avoiding LoadStore is basically free: a load can't retire until it's executed (taken data from the cache). A store can't commit until after it retires. In-order retirement automatically means that all previous loads are done before a store is "graduated" and ready for commit.

A weakly-ordered uarch that can in practice do load-store reordering might scoreboard loads as well as tracking them in the ROB: let them retire once they're known to be non-faulting but, even if the data hasn't arrived.

This seems more likely on an in-order core, but IDK. So you could have a load that's retired but the register destination will still stall if anything tries to read it before the data actually arrives. We know that in-order cores do in practice work this way, not requiring loads to complete before later instructions can execute. (That's why software-pipelining using lots of registers is so valuable on such cores, e.g. to implement a memcpy. Reading a load result right away on an in-order core destroys memory parallelism.)

How is load->store reordering possible with in-order commit? goes into this more deeply, for in-order vs. out-of-order.

Barrier instructions

The only barrier instruction that does anything for regular stores is mfence which in practice stalls memory ops (or the whole pipeline) until the store buffer is drained. Are loads and stores the only instructions that gets reordered? covers the Skylake-with-updated-microcode behaviour of acting like lfence as well.

lfence mostly exists for the microarchitectural effect of blocking later instructions from even issuing until all previous instructions have left the out-of-order back-end (retired). The use-cases for lfence fo memory ordering are nearly non-existent.

C++ How is release-and-acquire achieved on x86 only using MOV?
How is the transitivity/cumulativity property of memory barriers implemented micro-architecturally?
How many memory barriers instructions does an x86 CPU have?
How can I experience "LFENCE or SFENCE can not pass earlier read/write"
Does lock xchg have the same behavior as mfence?
Does the Intel Memory Model make SFENCE and LFENCE redundant?
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths goes into a lot of detail about how LFENCE stops execution of later instructions, and what that means for performance.
When should I use _mm_sfence _mm_lfence and _mm_mfence high-level languages have weaker memory models than x86, so you sometimes only need a barrier that compiles to no asm instructions. Using _mm_sfence() when you haven't used any NT stores just makes your code slower for no reason than atomic_thread_fence(mo_release).

129

answered Nov 06 '22 22:11

Peter Cordes

Related questions
                            
                                Why is protected mode needed in addition to compatibility mode in Intel x86 64 bit CPUs?
                            
                                How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?
                            
                                Best way to load/store from/to general purpose registers to/from xmm/ymm register
                            
                                Can PTEST be used to test if two registers are both zero or some other condition?
                            
                                Jump back some iterations for vectorized remainder loop
                            
                                Can two different objects with automatic storage duration compare equal under address comparison?
                            
                                Are write-combining buffers used for normal writes to WB memory regions on Intel?
                            
                                x86 BSWAP instruction REX doesn't follow Intel specs?
                            
                                libc's system() when the stack pointer is not 16-padded causes segmentation fault
                            
                                Difficulty understanding logic in disassembled binary bomb phase 3
                            
                                How to explicitly load a structure into L1d cache?
                            
                                Is there a way for a kernel module to find section addresses of another loaded module?
                            
                                Faster assembly optimized way to convert between RGB8 and RGB32 image
                            
                                Why does switch_to use push+jmp+ret to change EIP, instead of jmp directly?
                            
                                How can I write a QuadWord from AVX512 register zmm26 to the rax register?
                            
                                Where are the null-terminated strings when converting from C to assembly?
                            
                                How to specify clobbered bottom of the x87 FPU stack with extended gcc assembly?
                            
                                .rodata section loaded in executable page
                            
                                Why can't I mmap(MAP_FIXED) the highest virtual page in a 32-bit Linux process on a 64-bit kernel?
                            
                                How to use omp parallel for and omp simd together?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how are barriers/fences and acquire, release semantics implemented microarchitecturally?

Tags:

cpu-architecture

x86

x86-64

memory-barriers

micro-architecture

Raghu

People also ask

1 Answers

Barrier instructions

Peter Cordes

Recent Activity

Donate For Us