The L1/L2 cache are inclusive in Intel and L1 / L2 cache is 8 way associativity, means in a set there are 8 different cache lines exist. The cache lines are operated as a whole, means if I want to remove few bytes from a cache line, the whole cache line will be removed , not the only those bytes which I want to remove. Am I right ?
Now, my question is whenever a cache line of a set is removed/evicted from cache, either by some other process or by using clflush(manual eviction of a cache line/block ), does system store the evicted data of that cache line somewhere (in any buffer, register etc), so that next time it can load the data from that place to reduce the latency as compared to loading the data from main memory or higher level of cache, OR it ALWAYS simply invalidate the data in cache and next time loaded the data from next higher level.
Any suggestion or any link for the article will be highly appreciated. Thanks in advance.
L1/L2 are not necessarily inclusive, only the last-level cache is known to be so, which on i7 would be the L3. You are right in saying that a cache line is the basic caching element, you would have to throw a whole cacheline in order to fill in a new one (or when invalidating that single line). You can read some more about that here - http://www.tomshardware.com/reviews/Intel-i7-nehalem-cpu,2041-10.html
When a line is removed, the action taken depends on its MESI state (MESI and its derivatives are the protocols for cache coherency maintenance). If the line if modified, ("M") then the data must be "written-back" to the next level cache (in case of a miss it may allocate there, or "write-through" on to the next level - depends on the policy that cache maintains). Note that when you reach the last level cache you would have to hit as it's inclusive. When evicting a line from the last level cache - it would have to get written to the memory. Either way, failing to write back a modified line would result in loss of coherency, which would most likely result in incorrect execution.
If the line is not modified (Invalid, Exclusive or Shared), than the CPU may silently drop it without need of writeback, thereby saving bandwidth. By the way, there are also several other states in more complicated cache protocols (like MESIF or MOESI).
You can find lots of explanations by googling for "cache coherence protocols". If you prefer a more solid source, you can refer to any CPU architecture or cache design textbook, I personally recommend Hennessy&Patterson's "Computer Architecture, a quantitative approach", there's a whole chapter on cache performance, but that's a bit off topic here.
Small update: as of Skylake, some of the CPUs (server segments) no longer have an inclusive L3, but rather a non-inclusive (to support an increased L2). This means that clean lines are also likely to get written back when aging out of the L2, since the L3 does not normally hold copies of them.
More details: https://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/4
The L1/L2 cache are inclusive in Intel
Intel x86 processors with respect to cache inclusivity fall into one of the following categories:
Some processors have an L4 cache or a memory-side cache. These caches are NINE. In KNL and KNM, if MCDRAM is fully or partially configured to operate in cache mode, it's modified-inclusive of the L2 (and therefore the L1), meaning that inclusivity only applies to dirty lines in the L2 (in the M coherence state). On CSL processors that support Optane DIMMs, if the PMEM DIMMs are fully or partially configured to operate in cache mode, the DRAM DIMMs work as follows:
The Cascade Lake processor uses a novel cache management scheme using a combination of inclusive and noninclusive DRAM cache to reduce DRAM band-width overhead for writes while also eliminating the complexity of managing invalidates to processor caches on the eviction of an inclusive line from DRAM cache.
according to Cascade Lake: Next Generation Intel Xeon Scalable Processor.
The MCDRAM cache in KNL/KNM and DRAM cache in CSL do not fall in any of the three traditional inclusivity categories, namely inclusive, exclusive, and NINE. I think we can describe them as having "hybrid inclusivity."
AMD processors:
No existing AMD processor has an L4 cache or a memory-side cache beyond the L3.
VIA processors:
This covers all current VIA processors.
and L1 / L2 cache is 8 way associativity, means in a set there are 8 different cache lines exist.
This is true on most Intel processors. The only exception is the NetBurst microarchitecture where a single L2 way holds two adjacent cache lines, collectively called a sector.
An associativity of 8 is typical, but it's not uncommon to have different associativities. For example, the L1D in Sunny Cove is 12-way associative. See: How does the indexing of the Ice Lake's 48KiB L1 data cache work?.
The cache lines are operated as a whole, means if I want to remove few bytes from a cache line, the whole cache line will be removed , not the only those bytes which I want to remove. Am I right ?
Right, this is due to a limitation in the coherence state associated with each cache entry of each cache level. There is only one state for all of the bytes of a cache line.
does system store the evicted data of that cache line somewhere (in any buffer, register etc) so that next time it can load the data from that place to reduce the latency
There are several factors that impact this decision: (1) whether the line is dirty, (2) the inclusivity properties of the higher-numbered cache levels, if any, (3) whether the line is predicted to be accessed in the near future, and (4) if I remember correctly, if the memory type of a line changed from cacheable to uncacheable while it's resident in a cache, it'll be evicted and not cached in any other levels irrespective of the previous factors.
So a lazy answer that works for all processors is "maybe."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With