I notice that Intel Tremont has 64 bytes store instructions with MOVDIRI
and MOVDIR64B
.
Those guarantees atomic write to memory, whereas don't guarantee the load atomicity. Moreover, the write is weakly ordered, immediately followed fencing may be needed.
I find no MOVDIRx
in IceLake.
Why doesn't Ice Lake need such instructions like MOVDIRx
?
(At the bottom of page 15)
Intel® ArchitectureInstruction Set Extensions and Future FeaturesProgramming Reference
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf#page=15
Ice Lake has AVX512, which gives us 64-byte loads + stores, but no guarantee of 64-byte store atomicity.
We do get 64-byte NT stores with movntps [mem], zmm
/ movntdq [mem], zmm
. Interestingly, NT stores don't support merge-masking to leave some bytes unwritten. That would basically defeat the purpose of NT stores by creating partial-line writes, though.
Probably Ice Lake Pentium / Celeron CPUs still won't have AVX1/2, let alone AVX512 (probably so they can sell chips with defects in the upper 128 bits of the FMA units and/or register file on at least one core), so only rep movsb
will be able to internally use 64-byte loads/stores on those CPUs. (IceLake will have the "fast short rep" feature, which may make it useful even for small 64-byte copies, useful in kernel code that can't use vector regs.)
Possibly Intel can't (or doesn't want to) provide that atomicity guarantee on their mainstream CPUs, only on low-power chips that don't support multiple sockets, but I haven't heard any reports of tearing actually existing within a cache line on Intel CPUs. In practice, I think cached loads/stores that don't cross a cache-line boundary on current Intel CPUs are always atomic.
(Unlike on AMD K10 where HyperTransport did create tearing on 8B boundaries between sockets, while no tearing could be seen between cores on a single socket. SSE instructions: which CPUs can do atomic 16B memory operations?)
In any case, there's no way to detect this with CPUID, and it's not documented, so it's basically impossible to safely take advantage. It would be nice if there was a CPUID leaf that told you the atomicity width for the system and for within a single socket, so implementations that split 512-bit AVX512 ops into 256-bit halves would still be allowed....
Anyway, rather than introducing a special instruction with guaranteed store atomicity, I think it would be more likely for CPU vendors to start documenting and providing CPUID detection of wider store atomicity for either all power-of-2-size stores, or for only NT stores, or something.
Making some part of AVX512 require 64-byte atomicity would make it much harder for AMD to support, if they follow their current strategy of half-width vector implementation. (Zen2 will have 256-bit vector ALUs, making AVX1/AVX2 instructions mostly single-uop, but reportedly it won't have AVX512 support, unfortunately. AVX512 is a very nice ISA even if you only use it at 256-bit width, filling more gaps in what can be done conveniently / efficiently, e.g. unsigned int<->FP and [u]int64<->double.)
So IDK if maybe Intel agreed not to do that, or chose not to for their own reasons.
I suspect the main use-case is reliably creating 64-byte PCIe transactions, not actually "atomicity" per-se, and not for observation by another core.
If you cared about reading from other cores, normally you'd want L3 cache to backstop the data, not bypass it to DRAM. A seqlock is probably a faster way to emulate 64-byte atomicity between CPU cores, even if movdir64B
is available.
Skylake already has 12 write-combining buffers (up from 10 in Haswell), so it's (maybe?) not too hard to use regular NT stores to create a full-size PCIe transaction, avoiding early flushes. But maybe low-power CPUs have fewer buffers and maybe it's a challenge to reliably create 64B transactions to a NIC buffer or something.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With