Ok, I have been reading the following Qs from SO regarding x86 CPU fences (LFENCE
, SFENCE
and MFENCE
):
Does it make any sense instruction LFENCE in processors x86/x86_64?
What is the impact SFENCE and LFENCE to caches of neighboring cores?
Is the MESI protocol enough, or are memory barriers still required? (Intel CPUs)
and:
http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf
https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c
and I must be honest I am still not totally sure when a fence is required. I am trying to understand from the perspective of removing fully-blown locks and trying to use more fine-granular locking via fences, to minimise latency delays.
Firstly here are two specific questions I do not understand:
Sometimes when doing a store a CPU will write to its store buffer instead of the L1 cache. I do not however understand the terms on which a CPU will do this?
CPU2 may wish to load a value which has been written in to CPU1's store buffer. As I understand it, the problem is CPU2 cannot see the new value in CPU1's store buffer. Why can't the MESI protocol just include flushing store buffers as part of its protocol??
More generally, could somebody please attempt to describe the overall scenario and help explain when LFENCE
/MFENCE
and SFENCE
instructions are required?
NB One of the problems reading around this subject is the number of articles written "generally" for multiple CPU architectures, when I am only interested in the Intel x86-64 architecture specifically.
The LFENCE instruction provides a performance-efficient way of ensuring load ordering between routines that produce weakly-ordered results and routines that consume that data. Processors are free to fetch and cache data speculatively from regions of system memory that use the WB, WC, and WT memory types.
It effectively flushes the store buffer before proceeding. This user is third on the weekly Intel leaderboard.
The simplest answer: you must use one of 3 fences (LFENCE
, SFENCE
, MFENCE
) to provide one of 6 data Consistency:
C++11:
Initially, you should consider this problem from the point of view of the degree of order of memory access, which is well documented and standardized in C++11. You should read first: http://en.cppreference.com/w/cpp/atomic/memory_order
x86/x86_64:
1. Acquire-Release Consistency: Then, it is important to understand that in the x86 to access to conventional RAM (marked by default as WB - Write Back, and the same effect with WT (Write Throught) or UC (Uncacheable)) by using asm MOV
without any additional commands automatically provides order of memory for Acquire-Release Consistency - std::memory_order_acq_rel
.
I.e. for this memory makes sense to use only std::memory_order_seq_cst
only for provide Sequential Consistency. Ie when you are using: std::memory_order_relaxed
or std::memory_order_acq_rel
then the compiled assembler code for std::atomic::store()
(or std::atomic::load()
) will be the same - only MOV
without any L/S/MFENCE
.
Note: But you must know, that not only CPU but and C++-compiler can reorder operations with memory, and all 6 memory barriers always affect on the C++-compiler regardless of CPU architecture.
Then, you must know, how can it be compiled from C++ to ASM (native machine code) or how can you write it on assembler. To provide any Consistency exclude Sequential you can simple write MOV
, for example MOV reg, [addr]
and MOV [addr], reg
etc.
2. Sequential Consistency: But to provide Sequential Consistency you must use implicit (LOCK
) or explicit fences (L/S/MFENCE
) as described here: Why GCC does not use LOAD(without fence) and STORE+SFENCE for Sequential Consistency?
LOAD
(without fence) and STORE
+ MFENCE
LOAD
(without fence) and LOCK XCHG
MFENCE
+ LOAD
and STORE
(without fence)LOCK XADD
( 0 ) and STORE
(without fence)For example, GCC uses 1, but MSVC uses 2. (But you must know, that MSVS2012 has a bug: Does the semantics of `std::memory_order_acquire` requires processor instructions on x86/x86_64? )
Then, you can read Herb Sutter, your link: https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c
Exception to the rule:
This rule is true for access by using MOV
to conventional RAM marked by default as WB - Write Back. Memory is marking in the Page Table, in each PTE (Page Table Enrty) for each Page (4 KB continuous memory).
But there are some exceptions:
If we marks memory in Page Table as Write Combined (ioremap_wc()
in POSIX), then automaticaly provides only Acquire Consistency, and we must act as in the following paragraph.
See answer to my question: https://stackoverflow.com/a/27302931/1558037
- Writes to memory are not reordered with other writes, with the following exceptions:
- writes executed with the CLFLUSH instruction;
- streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and
- string operations (see Section 8.2.4.1).
In both cases 1 & 2 you must use additional SFENCE
between two writes to the same address even if you want Acquire-Release Consistency, because here automaticaly provides only Acquire Consistency and you must do Release (SFENCE
) yourself.
Answer to your two questions:
Sometimes when doing a store a CPU will write to its store buffer instead of the L1 cache. I do not however understand the terms on which a CPU will do this?
From the point of view of the user the cache L1 and Store Buffer act differently. L1 fast, but Store-Buffer faster.
Store-Buffer - is a simple Queue where stores only Writes, and which can not be reordered - it is made for performance increase and Hide Latency of access to cache (L1 - 1ns, L2 - 3ns, L3 - 10ns) (CPU-Core think that Write has stored to the cache and executes next command, but at the same time your Writes only saved to the Store-Buffer and will be saved to the cache L1/2/3 later), i.e. CPU-Core don't need to wait when Writes will have been stored to cache.
Cache L1/2/3 - look like transparent associate array (address - value). It is fast but not the fastest, because x86 automatically provides Acquire-Release Consistency by using cache coherent protocol MESIF/MOESI. It is done for simpler multithread programming, but decrease performance. (Truly, we can use Write Contentions Free algorithms and data structures without using cache coherent, i.e. without MESIF/MOESI for example over PCI Express). Protocols MESIF/MOESI works over QPI which connects Cores in CPU and Cores between different CPUs in multiprocessor systems (ccNUMA).
CPU2 may wish to load a value which has been written in to CPU1's store buffer. As I understand it, the problem is CPU2 cannot see the new value in CPU1's store buffer.
Yes.
Why can't the MESI protocol just include flushing store buffers as part of its protocol??
MESI protocol can't just include flushing store buffers as part of its protocol, because:
But manualy flushing Store Buffer on current CPU-Core - yes, you can do it by execute SFENCE
command. You can use SFENCE
in two cases:
Note:
Do we need LFENCE
in any cases on x86/x86_64? - the question is not always clear: Does it make any sense instruction LFENCE in processors x86/x86_64?
Other platform:
Then, you can read as in theory (for a spherical processor in vacuo) with Store-Buffer and Invalidate-Queue, your link: http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf
And how you can provide Sequential Consistency on other platforms, not only with L/S/MFENCE and LOCK but and with LL/SC: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With