The MASKMOVDQU1 is special among x86 store instructions because, in principle, it allows you to store individual bytes in a cache line, without first loading the entire cache line all the way to the core so that the written bytes can be merged with the not-overwritten existing bytes.
It would seem to work using the same mechanisms as an NT store: pushing the cache line down without first doing an RFO. Per the Intel software develope manual (emphasis mine):
The MASKMOVQ instruction can be used to improve performance for algorithms that need to merge data on a byteby-byte basis. It should not cause a read for ownership; doing so generates unnecessary bandwidth since data is to be written directly using the byte-mask without allocating old data prior to the store.
Unlike other NT stores, however, you can use a mask to specify which bytes are actually written.
In the case that you want to make sparse byte-granular writes across a large region which isn't likely to fit in any level of the cache, this instruction seems idea.
Unlike almost every other useful instruction, Intel haven't extended the instruction to 256 or 512 bits in AVX/AVX2 or AVX-512. Does this indicate that the use of this instruction is no longer recommended, perhaps cannot be implemented efficiently on current or future architectures?
1 ... and its 64-bit predecessor in MMX MASKMOVQ.
MASKMOVDQU
is indeed slow and probably never a good idea, like 1 per 6 cycle throughput on Skylake or one per 18c on Zen2 / Zen3.
I suspect that masked NT vector stores no longer work well for multi-core CPUs, so probably even the 128-bit version just sucks on modern x86 for masked writes, if there are any unmodified bytes in a full 64-byte line.
Regular (not NT) masked vector stores are back with a vengeance in AVX512. Masked commit to L1d cache seems to be efficiently supported for that, and for dword / qword masking with AVX1 vmaskmovps/pd
and integer equivalent on Intel CPUs. (Although not AMD: AMD only has efficient masked AVX1/2 loads, not stores. https://uops.info/table.html shows VPMASKMOVD M256, YMM, YMM
on Zen3 is 42 uops, 12c throughput, about the same as Zen2. vs. 3 uops, 1c latency on Skylake. Masked loads are fine on AMD, 1 uop 0.5c throughput, so actually better than Skylake for the AVX2 versions. Probably Skylake internally does a compare-into-mask and uses the HW designed for AVX-512.)
AVX512F made masking with dword/qword granularity a first-class citizen with very efficient support for both loads and stores. AVX512BW adds 8 and 16-bit element size, including masked load/store like vmovdqu8
which is also efficiently supported on Intel hardware; single uop even for stores.
The SDRAM bus protocol does support byte-masked writes (with 1 mask line per byte as part of a cache-line burst transfer). This Intel doc (about FPGAs or something) includes discussion of the DM
(data mask) signals, confirming that DDR4 still has them, with the same function as the DQM lines described on Wikipedia for SDRAM https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory#SDR_SDRAM. (DDR1 changed it to write-mask only, not read-mask.)
So the hardware functionality is there, and presumably modern x86 CPUs use it for single-byte writes to uncacheable memory, for example.
(Update: byte-masking may only be optional in DDR4, unlike some earlier SDRAM / DDR versions. In that case, the store could get to the memory controller in masked form, but the memory controller would have to read/modify/write the containing 8 byte chunk(s) using separate burst-read and burst-write commands to the actual DIMM. Chopping the bursts short is possible for stores that only affect part of a 64-byte DDR burst-size, saving some data bandwidth, but there's still the command overhead and taking buffer space in the mem controller for longer.)
No-RFO stores are great if we write a full line: we just invalidate other copies of the line and store to memory.
John "Dr. Bandwidth" McCalpin says that normal NT stores that flush after filling a full 64-byte line will invalidate even lines that are dirty, without causing a writeback of the dirty data.
So masked NT stores need to use a different mechanism, because any masked-out bytes need to take their value from the dirty line in another core, not from whatever was in DRAM.
If the mechanism for partial-line NT stores isn't efficient, adding new instructions that create it is unwise. I don't know if it's more or less efficient than doing normal stores to part of a line, or if that depends on the situation and uarch.
It doesn't have to be a RFO exactly, but it would mean that when such a store reaches the memory controller, it would have to get the snoop filter to make sure the line is in sync, or maybe merge with the old contents from cache before flushing to DRAM.
Or the CPU core could do an RFO and merge, before sending the full-line write down the memory hierarchy.
CPUs do already need some kind of mechanism for flushing partial-line NT stores when reclaiming an LFB that hasn't had all 64 bytes written yet, and we know that's not as efficient. (But I forget the details.) But maybe this is how maskmovdqu
executes on modern CPUs, either always or if you leave any bytes unmodified.
An experiment could probably find out.
So TL:DR maskmovqdu
may have only been implemented efficiently in single-core CPUs. It originated in Katmai Pentium III with MMX maskmovq mm0, mm1
; SMP systems existed, but maybe weren't the primary consideration for this instruction when it was being designed. SMP systems didn't have shared last-level cache, but they did still have private write-back L1d cache on each socket.
The description is misleading. The non-temporal aspect of MASKMOVQ is that it doesn't generate a RFO if you write the entire line. If you use the masked aspect, you still need to RMW, in which case you could just use the AVX-512 mask register.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With