Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where data goes after Eviction from cache set in case of Intel Core i3/i7

The L1/L2 cache are inclusive in Intel and L1 / L2 cache is 8 way associativity, means in a set there are 8 different cache lines exist. The cache lines are operated as a whole, means if I want to remove few bytes from a cache line, the whole cache line will be removed , not the only those bytes which I want to remove. Am I right ?

Now, my question is whenever a cache line of a set is removed/evicted from cache, either by some other process or by using clflush(manual eviction of a cache line/block ), does system store the evicted data of that cache line somewhere (in any buffer, register etc), so that next time it can load the data from that place to reduce the latency as compared to loading the data from main memory or higher level of cache, OR it ALWAYS simply invalidate the data in cache and next time loaded the data from next higher level.

Any suggestion or any link for the article will be highly appreciated. Thanks in advance.

like image 956
bholanath Avatar asked Oct 17 '13 01:10

bholanath


2 Answers

L1/L2 are not necessarily inclusive, only the last-level cache is known to be so, which on i7 would be the L3. You are right in saying that a cache line is the basic caching element, you would have to throw a whole cacheline in order to fill in a new one (or when invalidating that single line). You can read some more about that here - http://www.tomshardware.com/reviews/Intel-i7-nehalem-cpu,2041-10.html

When a line is removed, the action taken depends on its MESI state (MESI and its derivatives are the protocols for cache coherency maintenance). If the line if modified, ("M") then the data must be "written-back" to the next level cache (in case of a miss it may allocate there, or "write-through" on to the next level - depends on the policy that cache maintains). Note that when you reach the last level cache you would have to hit as it's inclusive. When evicting a line from the last level cache - it would have to get written to the memory. Either way, failing to write back a modified line would result in loss of coherency, which would most likely result in incorrect execution.

If the line is not modified (Invalid, Exclusive or Shared), than the CPU may silently drop it without need of writeback, thereby saving bandwidth. By the way, there are also several other states in more complicated cache protocols (like MESIF or MOESI).

You can find lots of explanations by googling for "cache coherence protocols". If you prefer a more solid source, you can refer to any CPU architecture or cache design textbook, I personally recommend Hennessy&Patterson's "Computer Architecture, a quantitative approach", there's a whole chapter on cache performance, but that's a bit off topic here.

Small update: as of Skylake, some of the CPUs (server segments) no longer have an inclusive L3, but rather a non-inclusive (to support an increased L2). This means that clean lines are also likely to get written back when aging out of the L2, since the L3 does not normally hold copies of them.

More details: https://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/4

like image 168
Leeor Avatar answered Nov 03 '22 01:11

Leeor


The L1/L2 cache are inclusive in Intel

Intel x86 processors with respect to cache inclusivity fall into one of the following categories:

  • There are three levels of caches. The L3 is inclusive of the L2 and L1. The L2 is NINE of the L1 (Not Inclusive, Not Exclusive). This category includes all of the following processors: (1) All client processors that implement the Core microarcitecture up to and including Rocket Lake, except for the Core X and Xeon W processor series designed for the client market segment. This also includes the Xeon W-10000 series for the client segment. (2) All server processors that implement the Core microarcitecture up to and including BDX, and (3) All Xeon E3, Xeon E, and Xeon W-1200 processors.
  • There area two levels of caches. The L2 is NINE of the L1. All Atom processors (including Tremont) belong to this category. All old Intel processors (with two cache levels) also belong here.
  • There are two levels of caches. The L2 is inclusive of the L1D and NINE of the L1I. KNL and KNM rocessors belong here. The information available for KNC and KNF says that the L2 is inclusive of the L1, although this could be inaccurate and the L2 may be only inclusive of the L1D on these processors too. See below for MCDRAM.
  • There are three levels of caches. The L3 and the L2 are both NINE. This category includes all of the following processors: (1) All Pentium 4 processors with three levels of caches, (2) All generations of Xeon SP processors, (3) Xeon D-2100, Skylake Core X series processors, Skylake Xeon W series processors, which all use the SKX uncore rather than the SKL uncore, and (4) All Tiger Lake processors.
  • Lakefield processors have a three-level cache hierarchy. The 4 Tremont cores share a NINE L2 and the Sunny Cove core has its own NINE L2. All of the 5 cores share an LLC that can be configured as either inclusive or NINE.

Some processors have an L4 cache or a memory-side cache. These caches are NINE. In KNL and KNM, if MCDRAM is fully or partially configured to operate in cache mode, it's modified-inclusive of the L2 (and therefore the L1), meaning that inclusivity only applies to dirty lines in the L2 (in the M coherence state). On CSL processors that support Optane DIMMs, if the PMEM DIMMs are fully or partially configured to operate in cache mode, the DRAM DIMMs work as follows:

The Cascade Lake processor uses a novel cache management scheme using a combination of inclusive and noninclusive DRAM cache to reduce DRAM band-width overhead for writes while also eliminating the complexity of managing invalidates to processor caches on the eviction of an inclusive line from DRAM cache.

according to Cascade Lake: Next Generation Intel Xeon Scalable Processor.

The MCDRAM cache in KNL/KNM and DRAM cache in CSL do not fall in any of the three traditional inclusivity categories, namely inclusive, exclusive, and NINE. I think we can describe them as having "hybrid inclusivity."


AMD processors:

  • Zen family: The L2 is inclusive and the L3 is NINE.
  • Bulldozer family: The L2 is NINE and the L3 NINE.
  • Jaguar and Puma: The L2 is inclusive. There is no L3.
  • K10 and Fusion: The L2 is exclusive. There is no L3.
  • Bobcat: I don't know about the L2. There is no L3.
  • K7 (models 3 and later) and K8: The L2 is exclusive. There is no L3.
  • K7 (models 1 and 2) and older: The L2 is inclusive. There is no L3.

No existing AMD processor has an L4 cache or a memory-side cache beyond the L3.

VIA processors:

  • Nano C and Eden C: I don't know about the L2. There is no L3.
  • All older processors: The L2 is exclusive. There is no L3.

This covers all current VIA processors.


and L1 / L2 cache is 8 way associativity, means in a set there are 8 different cache lines exist.

This is true on most Intel processors. The only exception is the NetBurst microarchitecture where a single L2 way holds two adjacent cache lines, collectively called a sector.

An associativity of 8 is typical, but it's not uncommon to have different associativities. For example, the L1D in Sunny Cove is 12-way associative. See: How does the indexing of the Ice Lake's 48KiB L1 data cache work?.

The cache lines are operated as a whole, means if I want to remove few bytes from a cache line, the whole cache line will be removed , not the only those bytes which I want to remove. Am I right ?

Right, this is due to a limitation in the coherence state associated with each cache entry of each cache level. There is only one state for all of the bytes of a cache line.

does system store the evicted data of that cache line somewhere (in any buffer, register etc) so that next time it can load the data from that place to reduce the latency

There are several factors that impact this decision: (1) whether the line is dirty, (2) the inclusivity properties of the higher-numbered cache levels, if any, (3) whether the line is predicted to be accessed in the near future, and (4) if I remember correctly, if the memory type of a line changed from cacheable to uncacheable while it's resident in a cache, it'll be evicted and not cached in any other levels irrespective of the previous factors.

So a lazy answer that works for all processors is "maybe."

like image 24
Hadi Brais Avatar answered Nov 03 '22 01:11

Hadi Brais