Commonly, cacheline is 64B but atomicity of non-volatile memory is 8B.
For example:
x[1]=100;
x[2]=100;
clflush(x);
x
is cacheline aligned, and is initially set to 0
.
System crashs in clflush();
Is it possible x[1]=0
, x[2]=100
after reboot?
Under the following assumptions:
The global observablility order of stores may differ from the persist order on Intel x86 processors. This is referred to as relaxed persistency. The only case in which the order is guaranteed to be the same is for a sequence of stores of type WB to the same cache line (but a store reaching GO doesn't necessarily meant it's become durable). This is because CLFLUSH
is atomic and WB stores cannot be reordered in global observability. See: On x86-64, is the “movnti” or "movntdq" instruction atomic when system crash?.
The x86-TSO memory model doesn't allow reordering stores, so it's impossible for another agent to observe x[2] == 100
and x[1] != 100
during normal operation (i.e., in the volatile state without a crash). However, if the system crashed and rebooted, it's possible for the persistent state to be x[2] == 100
and x[1] != 100
. This is possible even if the system crashed after retiring clflush
because the retirement of clflush
doesn't necessarily mean that the cache line flushed has reached the persistence domain.
If you want to eliminate that possibly, you can either move clflush
as follows:
x[1]=100;
clflush(x);
x[2]=100;
clflush
on Intel processors is ordered with respect to all writes, meaning that the line is guaranteed to reach the persistence domain before any later stores become globally observable. See: Persistent Memory Programming Primary (PDF) and the Intel SDM V2. The second store could be to the same line or any other line.
If you want x[1]=100
to become persistent before x[2]=100
becomes globally observable, add sfence
after clflush
on Intel CSX or mfence
on AMD processors (clflush
is only ordered by mfence
on AMD processors). clflush
by itself sufficient to control persist order.
Alternatively, use the sequenceclflushopt+sfence
(or clwb+sfence
) as follows:
x[1]=100;
clflushopt(x);
sfence;
x[2]=100;
In this case, if a crashed happened and if x[2] == 100
in the persistent state, then it's guaranteed that x[1] == 100
. clflushopt
by itself doesn't impose any persist ordering.
(Also see @Hadi's answer: x86 TSO store ordering does not guarantee persistence ordering even within one line. This answer doesn't try to address that. My best guess based on Hadi's answer is that a single atomic store to one 32-byte half of a cache line will persist atomically, but that's based on how current HW works, transferring lines in 2 32-byte halves between cores, caches, and memory controllers. If this really matters, look for docs or ask Intel.)
Remember that store data can propagate out of cache (into DRAM or NVDIMM) on its own, before explicit flushing.
The following sequence of events is possible:
x[2]=100;
store the 3nd byte of the cache line first. (Compile-time reordering: this is a C not asm question and x
is apparently plain uint8_t x[64]
, not _Atomic or volatile so x[1]=100;
and x[2]=100;
aren't guaranteed to happen in that order in the asm.)x[]
is evicted all the way out of cache, into the persistence domain. (Perhaps after a context-switch to another thread, so lots of other code runs between those two asm stores).x[1]=100;
finishes becoming durable.)If you want to depend on x86 memory ordering rules to control durability order within a cache line, you need to make sure C respects that. volatile
would work, or _Atomic
with memory_order_release
for at least the 2nd store. (Or better, get them done as a single store if it's within an aligned 8-byte chunk.) (x86 asm memory model = program order with a store buffer; no StoreStore reordering.)
Compile-time reordering doesn't usually happen for no reason (but it can); more often due to surrounding code making it appealing to do so. But surrounding code could cause this. (And of course x[1]=100;
/ x[2]=0;
can happen by this mechanism without any compile-time reordering, if it's 2 separate stores.)
I think a necessary pre-condition for atomicity of durability is being done as a single atomic store. e.g. guaranteed atomic by the ISA, or with a single wider SIMD store1 because Intel CPUs in practice don't split those up (but there's no on-paper guarantee of that). Being atomic wrt. interrupts (i.e. a single instruction) but not a single store uop makes it harder to split up but still completely possible2 and thus not guaranteed safe. e.g. a 10-byte x87 fstp tbyte
that involves 2 separate store-data uops can be split up by an invalidation from another core that's possible even without false sharing. (See footnote 2 again.)
Without any on-paper atomicity guarantee for 16-byte or wider SIMD stores, you'd be depending on implementation details for SIMD stores or misaligned stores to not be split up.
Even ISA-guaranteed atomicity isn't sufficient, though: a lock cmpxchg
that spans a cache-line boundary is still guaranteed atomic wrt. other cores and DMA readers. (Supporting this is very very slow, don't do it.) But there's no way to guarantee those two lines become durable at the same time. But outside of that special case for atomicity, IDK, I can't rule out whole-line atomicity. It's certainly plausible that a plain-store into a single line which is atomic in asm will become durable atomically, with no chance of tearing.
I'd guess that an atomic store within an 8-byte-aligned block will make it to persistence atomically or not at all, but I haven't checked Intel's docs. (And in practice perhaps even a whole 64-byte line, which you could store with AVX512). The point of this answer is that you don't even have a single atomic store so there are lots of other mechanisms for breaking your test-case.
Footnote 1: Modern Intel CPUs commit SIMD stores to L1d cache as a single transaction as long as they don't span a cache line. Intel hasn't made a CPU that splits SIMD stores into 2 halves since Sandy/Ivy Bridge which had full-width 256-bit AVX execution units but only 128-bit wide paths to/from cache in the load units and AFAIK in the store-buffer-commit stuff. (The store-data execution unit also took 2 cycles to write 32-byte store data to the store buffer).
Footnote 2: For separate store uops that are part of the same instruction like in fstp tbyte [rdi]
, this might be possible:
first part commits from the store buffer to L1d cache
an RFO or share-request arrives and is handled before the 2nd store from the same instruction commits: This core's copy is now Invalid or Shared so the commit from the store buffer to L1d is blocked until it regains exclusive ownership. The 2nd part of the store from that one instruction is at the head of the store buffer, not in coherent cache.
The other core that was doing an RFO followed up their store with a clflush
, evicting this line to persistent memory before the first core can get it back and finish committing the other data from that one instruction.
An NT store like movnti
by another core would force eviction of the line as part of committing the NT store, like a normal store + clflushopt.
This scenario requires false-sharing between two threads trying to persist 2 separate things in the same line, so can be avoided if you avoid false-sharing e.g. with padding. (Or some insane true sharing, or firing off clflush
without storing first, on memory that other threads might be in the middle of writing).
(Or more plausible for software, much less plausible for hardware): The line gets evicted on its own before the first writer gets it back, even though a core has a pending RFO for it. (As soon as it loses ownership, the first core would send out an RFO).
(Or fully plausible without false-sharing): Forced eviction from L2/L1d at any time due to eviction from an inclusive cache-line tracking structure. This could be triggered by demand for lines that merely happen to alias the same set in L3, not false sharing.
Skylake-server (SKX) has non-inclusive L3, as do later Intel server CPUs. Cascade Lake (CSX) was the first to support persistent memory. Even though it has a non-inclusive L3, the snoop filter is inclusive and a fill conflict that causes an eviction does cause a back invalidation in the entire NUMA node.
So an invalidate request can arrive at any time, and it's likely the core / store buffer isn't going to hold onto the line for more cycles to commit an unknown number of more stores to the same line.
(By that point, the fact that both store-buffer entries were part of one instruction is probably lost. It's possible for an access pattern to create a stream of store buffer entries that store different parts of the same cache line indefinitely, so waiting until "all stores for this line are done" could let unprivileged code create a denial-of-service for a core that wanted to read it. So I think it's unlikely that HW would have a mechanism to avoid releasing ownership of a cache line between stores that came from the same instruction.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With