Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?

This question is specifically aimed at modern x86-64 cache coherent architectures - I appreciate the answer can be different on other CPUs.

If I write to memory, the MESI protocol requires that the cache line is first read into cache, then modified in the cache (the value is written to the cache line which is then marked dirty). In older write-though micro-architectures, this would then trigger the cache line being flushed, under write-back the cache line being flushed can be delayed for some time, and some write combining can occur under both mechanisms (more likely with writeback). And I know how this interacts with other cores accessing the same cache-line of data - cache snooping etc.

My question is, if the store matches precisely the value already in the cache, if not a single bit is flipped, does any Intel micro-architecture notice this and NOT mark the line as dirty, and thereby possibly save the line from being marked as exclusive, and the writeback memory overhead that would at some point follow?

As I vectorise more of my loops, my vectorised-operations compositional primitives don't explicitly check for values changing, and to do so in the CPU/ALU seems wasteful, but I was wondering if the underlying cache circuitry could do it without explicit coding (eg the store micro-op or the cache logic itself). As shared memory bandwidth across multiple cores becomes more of a resource bottleneck, this would seem like an increasingly useful optimisation (eg repeated zero-ing of the same memory buffer - we don't re-read the values from RAM if they're already in cache, but to force a writeback of the same values seems wasteful). Writeback caching is itself an acknowledgement of this sort of issue.

Can I politely request holding back on "in theory" or "it really doesn't matter" answers - I know how the memory model works, what I'm looking for is hard facts about how writing the same value (as opposed to avoiding a store) will affect the contention for the memory bus on what you may safely assume is a machine running multiple workloads that are nearly always bound by memory bandwidth. On the other hand an explanation of precise reasons why chips don't do this (I'm pessimistically assuming they don't) would be enlightening...

Update: Some answers along the expected lines here https://softwareengineering.stackexchange.com/questions/302705/are-there-cpus-that-perform-this-possible-l1-cache-write-optimization but still an awful lot of speculation "it must be hard because it isn't done" and saying how doing this in the main CPU core would be expensive (but I still wonder why it can't be a part of the actual cache logic itself).

Update (2020): Travis Downs has found evidence of Hardware Store Elimination but only, it seems, for zeros and only where the data misses L1 and L2, and even then, not in all cases. His article is highly recommended as it goes into much more detail.... https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html

Update (2021): Travis Downs has now found evidence that this zero store optimisation has recently been disabled in microcode... more detail as ever from the source himself https://travisdowns.github.io/blog/2021/06/17/rip-zero-opt.html

like image 608
Tim Avatar asked Nov 21 '17 16:11

Tim


1 Answers

I find evidence that some modern x86 CPUs from Intel, including Skylake and Ice Lake client chips, can optimize redundant (silent) stores in at least one specific case:

  • An all zero cache line is overwritten fully or partially with more zeros.

That is, a "zeros over zeros" scenario.

For example, this chart shows the performance (the circles, measured on the left axis) and relevant performance counters for a scenario where a region of varying size is filed with 32-bit values of either zero or one, on Ice Lake:

Ice Lake Fill Performance

Once the region no longer fits in the L2 cache, there is a clear advantage for writing zeroes: the fill throughput is almost 1.5x higher. In the case of zeros, we also see that the evictions from L2 are not almost all "silent", indicating that no dirty data needed to written out, while in the other case all evictions are non-silent.

Some miscellaneous details about this optimization:

  • It optimizes the write-back of the dirty cache line, not the RFO which still needs to occur (indeed, the read is probably needed to decide that the optimization can be applied).
  • It seems to occur around the L2 or L2 <-> L3 interface. That is, I don't find evidence of this optimization for loads that fit in L1 or L2.
  • Because the optimization takes effect at some point outside the innermost layer of the cache hierarhcy, It is not necessary to only write zeros to take advantage: it is enough that the line contains all zeros only once it is written back to the L3. So starting with an all-zero line, you can do any amount of non-zero writes, followed by a final zero-write of the entire line1, as long as the line does not escape to the L3 in the meantime.
  • The optimization has varying performance effects: sometimes the optimization is occurring based on observation of relevant perf counts, but there is almost no increased throughput. Other times the impact can be very large.
  • I don't find evidence of the effect in Skylake server or earlier Intel chips.

I wrote this up in more detail here, and there is an addendum for Ice Lake, which exhibits this effect more strongly here.

Update, June 2021: This optimization has been disabled in the newest CPU microcode versions provided by Intel, for security reasons (details).


1 Or, at least overwrite the non-zero parts of the line with zeros.

like image 191
BeeOnRope Avatar answered Oct 03 '22 20:10

BeeOnRope