Say I have two threads that manipulate the global variable <code>x</code>. Each thread (or each core I suppose) will have a cached copy of <code>x</code>. Now say that <code>Thread A</code> executes the following instructions: <pre class="prettyprint"><code>set x to 5 some other instruction </code></pre> Now when <code>set x to 5</code> is executed, the cached value of <code>x</code> will be set to <code>5</code>, this will cause the cache coherence protocol to act and update the caches of the other cores with the new value of <code>x</code>. Now my question is: when <code>x</code> is actually set to <code>5</code> in <code>Thread A</code>'s cache, do the caches of the other cores get updated before <code>some other instruction</code> is executed? Or should a memory barrier be used to ensure that?: <pre class="prettyprint"><code>set x to 5 memory barrier some other instruction </code></pre> Note: Assume that the instructions were executed in order, also assume that when <code>set x to 5</code> is executed, <code>5</code> is immediately placed in Thread A`'s cache (so the instruction was not placed in a queue or something to be executed later).

The memory barriers present on the x86 architecture - but this is true in general - not only guarantee that all the previous1 loads, or stores, are completed before any subsequent load or store is executed - they also guarantee that the stores have became globally visible. By globally visible it is meant that other cache-aware agents - like other CPUs - can see the store. Other agents non aware of the caches - like a DMA capable device - will not usually see the store if the target memory has been marked with a cache type that doesn't enforce an immediate write into memory. This has nothing to do with the barrier it-self, it is a simple fact of the x86 architecture: caches are visible to the programmer and when dealing with hardware they are usually disabled. Intel is purposely generic on the description of the barriers because it doesn't want to tie her-self to a specific implementation. You need to think in abstract: globally visible implies that the hardware will take all the necessary steps to make the store globally visible. Period. To understand the barriers however it is worth taking a look at the current implementations. Note that Intel is free to turn the modern implementation up-side down at will, as long it keep the visible behaviour correct. A store in an x86 CPU is executed in the core, then placed in the store buffer. For example <code>mov DWORD [eax+ebx*2+4], ecx</code>, once decoded is stalled until <code>eax</code>, <code>ebx</code> and <code>ecx</code> are ready2 then it is dispatched to an execution unit capable of computing its address. When the execution is done the store has become a pair (address, value) that is moved into the store buffer. The store is said to be completed locally (in the core). The store buffer allows the OoO part of the CPU to forget about the store and consider it completed even if an attempt to write is has not even been made yet. Upon specific events, like a serialization event, an exception, the execution of a barrier or the exhaustion of the buffer, the CPU flushes the store buffer. The flush is always in order - First In, First written. From the store buffer the store enters the realm of the cache. It can be combined yet into another buffer called the Write Combining buffer (and later written into memory by-passing the caches) if the target address is marked with a WC cache type, it can be written into the L1D cache, the L2, the L3 or the LLC if it is not one of the previous if the cache type is WB or WT. It can also be written directly in memory if the cache type is UC or WT. <hr> As today that's what it means to become globally visible: leave the store buffer. Beware of two very important things: <ol> <li>The cache type still influences the visibility. Globally visible doesn't mean visible in memory, it means visible where loads from other cores will see it. If the memory region is WB cacheable, the load could end in the cache, so it is globally visible there - only for the agent aware of the existence of the cache. (But note that most DMA on modern x86 is cache-coherent).</li> <li>This also apply to the WC buffer that is non-coherent. The WC is not kept coherent - its purpose is to coalesce the stores to memory areas where the order doesn't matter, like a framebuffer. This is not really globally visible yet, only after the write-combining buffer is flushed can anything outside the core see it. </li> </ol> <code>sfence</code> does exactly that: wait for all the previous stores to complete locally and then drains the store buffer. Since each store in the store buffer can potentially miss, you see how heavy such instruction is. (But out-of-order execution including later loads can continue. Only <code>mfence</code> would block later loads from being globally visible (reading from L1d cache) until after the store buffer finishes committing to cache.) But does <code>sfence</code> wait for the store to propagates into other caches? Well, no. Because there is not propagation - lets see what a write into the cache implies from an high-level perspective. The cache is kept coherent among all the processors with the MESI protocol (MESIF for multi-socket Intel systems, MOESI for AMD ones). We will only see MESI. Suppose the writes indexes the cache line L, and suppose all the processors has this line L in their caches with the same value. The state of this line is Shared, in every CPU. When our stores lands in the cache, L is marked as Modified and a special transaction is made on the internal bus (or QPI for multi-socket Intel systems) to invalidate line L in other processors. If L was not initially in the S state, the protocol is changed accordingly (e.g. if L is in state Exclusive no transactions on the bus are done[1]). At this point the write is complete and <code>sfence</code> completes. This is enough to keep the cache coherent. When another CPU request line L, our CPU snoops that request and L is flushed to memory or into the internal bus so the other CPU will read the updated version. The state of L is set to S again. So basically L is read on-demand - this makes sense since propagating the write to other CPU is expensive and some architectures do it by writing L back into memory (this works because the other CPU has L in state Invalid so it must read it from memory). <hr> Finally it is not true that <code>sfence</code> et all are normally useless, on the contrary they are extremely useful. It is just that normally we don't care how other CPUs see us making our stores - but acquiring a lock without an acquiring semantic as defined, for example, in C++, and implemented with the fences, is totally nuts. You should think of the barriers as Intel says: they enforce the order of global visibility of memory accesses. You can help your self understanding this by thinking of the barriers as enforcing the order or writing into the cache. The cache coherence will then take rest of assuring that a write to a cache is globally visible. I can't help but stress out one more time that cache coherency, global visibility and memory ordering are three different concepts. The first guarantees the second, that is enforced by the third. <pre class="prettyprint"><code>Memory ordering -- enforces --> Global visibility -- needs -> Cache coherency '.______________________________'_____________.' ' Architectural ' ' '._______________________________________.' micro-architectural </code></pre> <hr> Footnotes: <ol> <li>In program order. </li> <li> That was a simplification. On Intel CPUs, <code>mov [eax+ebx*2+4], ecx</code> decodes into two separate uops: store-address and store-data. The store-address uop has to wait until <code>eax</code> and <code>ebx</code> are ready, then it is dispatched to an execution unit capable of computing its address. That execution unit writes the address into the store buffer, so later loads (in program order) can check for store-forwarding. When <code>ecx</code> is ready, the store-data uop can dispatch to the store-data port, and write the data into the same store buffer entry. This can happen before or after the address is known, because the store-buffer entry is reserved probably in program order, so the store buffer (aka memory order buffer) can keep track of load / store ordering once the address of everything is eventually known, and check for overlaps. (And for speculative loads that ended up violating x86's memory ordering rules if another core invalidated the cache line they loaded from before the earliest point they were architecturally allowed to laod. This leads to a memory-order mis-speculation pipeline clear.) </li> </ol>

Does a memory barrier ensure that the cache coherence has been completed?

Tags:

operating-system

x86

assembly

cpu-cache

memory-barriers

Say I have two threads that manipulate the global variable x. Each thread (or each core I suppose) will have a cached copy of x.

Now say that Thread A executes the following instructions:

set x to 5 some other instruction

Now when set x to 5 is executed, the cached value of x will be set to 5, this will cause the cache coherence protocol to act and update the caches of the other cores with the new value of x.

Now my question is: when x is actually set to 5 in Thread A's cache, do the caches of the other cores get updated before some other instruction is executed? Or should a memory barrier be used to ensure that?:

set x to 5 memory barrier some other instruction

Note: Assume that the instructions were executed in order, also assume that when set x to 5 is executed, 5 is immediately placed in Thread A`'s cache (so the instruction was not placed in a queue or something to be executed later).

978

asked Mar 12 '17 11:03

Christopher

2 Answers

The memory barriers present on the x86 architecture - but this is true in general - not only guarantee that all the previous¹ loads, or stores, are completed before any subsequent load or store is executed - they also guarantee that the stores have became globally visible.

By globally visible it is meant that other cache-aware agents - like other CPUs - can see the store.
Other agents non aware of the caches - like a DMA capable device - will not usually see the store if the target memory has been marked with a cache type that doesn't enforce an immediate write into memory.
This has nothing to do with the barrier it-self, it is a simple fact of the x86 architecture: caches are visible to the programmer and when dealing with hardware they are usually disabled.

Intel is purposely generic on the description of the barriers because it doesn't want to tie her-self to a specific implementation.
You need to think in abstract: globally visible implies that the hardware will take all the necessary steps to make the store globally visible. Period.

To understand the barriers however it is worth taking a look at the current implementations.
Note that Intel is free to turn the modern implementation up-side down at will, as long it keep the visible behaviour correct.

A store in an x86 CPU is executed in the core, then placed in the store buffer.
For example mov DWORD [eax+ebx*2+4], ecx, once decoded is stalled until eax, ebx and ecx are ready² then it is dispatched to an execution unit capable of computing its address.
When the execution is done the store has become a pair (address, value) that is moved into the store buffer.
The store is said to be completed locally (in the core).

The store buffer allows the OoO part of the CPU to forget about the store and consider it completed even if an attempt to write is has not even been made yet.

Upon specific events, like a serialization event, an exception, the execution of a barrier or the exhaustion of the buffer, the CPU flushes the store buffer.
The flush is always in order - First In, First written.

From the store buffer the store enters the realm of the cache.
It can be combined yet into another buffer called the Write Combining buffer (and later written into memory by-passing the caches) if the target address is marked with a WC cache type, it can be written into the L1D cache, the L2, the L3 or the LLC if it is not one of the previous if the cache type is WB or WT.
It can also be written directly in memory if the cache type is UC or WT.

As today that's what it means to become globally visible: leave the store buffer.
Beware of two very important things:

The cache type still influences the visibility.
Globally visible doesn't mean visible in memory, it means visible where loads from other cores will see it.
If the memory region is WB cacheable, the load could end in the cache, so it is globally visible there - only for the agent aware of the existence of the cache. (But note that most DMA on modern x86 is cache-coherent).
This also apply to the WC buffer that is non-coherent.
The WC is not kept coherent - its purpose is to coalesce the stores to memory areas where the order doesn't matter, like a framebuffer. This is not really globally visible yet, only after the write-combining buffer is flushed can anything outside the core see it.

sfence does exactly that: wait for all the previous stores to complete locally and then drains the store buffer.
Since each store in the store buffer can potentially miss, you see how heavy such instruction is. (But out-of-order execution including later loads can continue. Only mfence would block later loads from being globally visible (reading from L1d cache) until after the store buffer finishes committing to cache.)

But does sfence wait for the store to propagates into other caches?
Well, no.
Because there is not propagation - lets see what a write into the cache implies from an high-level perspective.

The cache is kept coherent among all the processors with the MESI protocol (MESIF for multi-socket Intel systems, MOESI for AMD ones).
We will only see MESI.

Suppose the writes indexes the cache line L, and suppose all the processors has this line L in their caches with the same value.
The state of this line is Shared, in every CPU.

When our stores lands in the cache, L is marked as Modified and a special transaction is made on the internal bus (or QPI for multi-socket Intel systems) to invalidate line L in other processors.

If L was not initially in the S state, the protocol is changed accordingly (e.g. if L is in state Exclusive no transactions on the bus are done^[1]).

At this point the write is complete and sfence completes.

This is enough to keep the cache coherent.
When another CPU request line L, our CPU snoops that request and L is flushed to memory or into the internal bus so the other CPU will read the updated version.
The state of L is set to S again.

So basically L is read on-demand - this makes sense since propagating the write to other CPU is expensive and some architectures do it by writing L back into memory (this works because the other CPU has L in state Invalid so it must read it from memory).

Finally it is not true that sfence et all are normally useless, on the contrary they are extremely useful.
It is just that normally we don't care how other CPUs see us making our stores - but acquiring a lock without an acquiring semantic as defined, for example, in C++, and implemented with the fences, is totally nuts.

You should think of the barriers as Intel says: they enforce the order of global visibility of memory accesses.
You can help your self understanding this by thinking of the barriers as enforcing the order or writing into the cache. The cache coherence will then take rest of assuring that a write to a cache is globally visible.

I can't help but stress out one more time that cache coherency, global visibility and memory ordering are three different concepts.
The first guarantees the second, that is enforced by the third.

Memory ordering -- enforces --> Global visibility -- needs -> Cache coherency '.______________________________'_____________.'                            '                  Architectural  '                                           '                                  '._______________________________________.'                                              micro-architectural

Footnotes:

In program order.
That was a simplification. On Intel CPUs, mov [eax+ebx*2+4], ecx decodes into two separate uops: store-address and store-data. The store-address uop has to wait until eax and ebx are ready, then it is dispatched to an execution unit capable of computing its address. That execution unit writes the address into the store buffer, so later loads (in program order) can check for store-forwarding.

When ecx is ready, the store-data uop can dispatch to the store-data port, and write the data into the same store buffer entry.

This can happen before or after the address is known, because the store-buffer entry is reserved probably in program order, so the store buffer (aka memory order buffer) can keep track of load / store ordering once the address of everything is eventually known, and check for overlaps. (And for speculative loads that ended up violating x86's memory ordering rules if another core invalidated the cache line they loaded from before the earliest point they were architecturally allowed to laod. This leads to a memory-order mis-speculation pipeline clear.)

167

answered Oct 06 '22 00:10

Margaret Bloom

Now when set x to 5 is executed, the cached value of x will be set to 5, this will cause the cache coherence protocol to act and update the caches of the other cores with the new value of x.

There are multiple different x86 CPUs with different cache coherency protocols (none, MESI, MOESI), plus different types of caching (uncached, write-combining, write-only, write-through, write-back).

In general when a write is being done (when setting x to 5) the CPU determines the type of caching being done (from MTRRs or TLBs), and if the cache line could be cached it checks its own cache to determine what state that cache line is in (from its own perspective).

Then the type of caching and the state of the cache line is used to determine if the data is written directly to the physical address space (bypassing caches), or if it has to fetch the cache line from elsewhere while simultaneously telling other CPUs to invalidate old copies, or if it has exclusive access in its own caches and can modify it in the cache without telling anything.

A CPU never "injects" data into another CPU's cache (and only tells other CPUs to invalidate/discard their copy of a cache line). Telling other CPUs to invalidate/discard their copy of a cache line causes them to fetch the current copy of it if/when they want it again.

Note that none of this has anything to do with memory barriers.

There are 3 types of memory barriers (sfence, lfence and mfence), which tell the CPU to complete stores, loads or both before allowing later stores, loads or both to occur. Because the CPU is normally cache coherent anyway these memory barriers/fences are normally pointless/unnecessary. However there are situations where the CPU is not cache coherent (incuding "store forwarding", when the write-combining caching type is being used, when non-temporal stores are being used, etc). Memory barriers/fences are needed to enforce ordering (if necessary) for these special/rare cases.

answered Oct 06 '22 00:10

Brendan

Related questions
                            
                                What is the 'shadow space' in x64 assembly?
                            
                                Converting very simple ARM instructions to binary/hex
                            
                                What are the first operations that the Linux Kernel executes on boot?
                            
                                What does the LEAL assembly instruction do?
                            
                                JNZ & CMP Assembly Instructions
                            
                                What is a code cave, and is there any legitimate use for one?
                            
                                Carry Flag, Auxiliary Flag and Overflow Flag in Assembly
                            
                                x64 memset core, is passed buffer address truncated?
                            
                                How many instructions are there on x86 today? [closed]
                            
                                How does this C program without libc work?
                            
                                Producing good add with carry code from clang
                            
                                GDB: Attempt to dereference generic pointer
                            
                                difference between MMX and XMM register?
                            
                                Good reference for x86 assembly instructions [closed]
                            
                                System Calls in windows & Native API?
                            
                                Assembly 'call' vs 'jmp'
                            
                                CPU Switches from User mode to Kernel Mode : What exactly does it do? How does it makes this transition?
                            
                                History of syscalls added to Linux?
                            
                                Why doesn't GCC use partial registers?
                            
                                Does Linux use x86 CPU's PCID feature for TLB? If not, why?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With