Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which cache mapping technique is used in intel core i7 processor?

I have learned about different cache mapping techniques like direct mapping and fully associative or set associative mapping, and the trade-offs between those. (Wikipedia)

But I am curious which one is used in Intel core i7 or AMD processors nowadays?

How have the techniques evolved? And what are things that need to be improved?

like image 366
Subhadip Avatar asked Mar 04 '18 06:03

Subhadip


1 Answers

Direct-mapped caches are basically never used in modern high-performance CPUs. The power savings are outweighed by the large advantage in hit rate for a set-associative cache of the same size, with only a bit more complexity in the control logic. Transistor budgets are very large these days.

It's very common for software to have at least a couple arrays that are a multiple of 4k apart from each other, which would create conflict misses in a direct-mapped cache. (Tuning code with more than a couple arrays can involve skewing them to reduce conflict misses, if a loop needs to iterate through all of them at once)

Modern CPUs are so fast that DRAM latency is over 200 core clock cycles, which is too big even for powerful out-of-order execution CPUs to hide very well on a cache miss.


Multi-level caches are essential (and used is all high-performance CPUs) to give the low latency (~4 cycles) / high throughput for the hottest data (e.g. up to 2 loads and 1 store per clock, with a 128, 256 or even 512-bit path between L1D cache and vector load/store execution units), while still being large enough to cache a reasonable sized working set. It's physically impossible to build one very large / very fast / highly-associative cache that performs as well as current multi-level caches for typical workloads; speed-of-light delays when data has to physically travel far are a problem. The power cost would be prohibitive as well. (In fact, power / power density is a major limiting factor for modern CPUs, see Modern Microprocessors: A 90-Minute Guide!.)

All levels of cache (except the uop cache) are physically indexed / physically tagged in all the x86 CPUs I'm aware of. L1D caches in most designs take their index bits from below the page offset, and thus are also VIPT allowing TLB lookup to happen in parallel with tag fetch, but without any aliasing problems. Thus, caches don't need to be flushed on context switches or anything. (See this answer for more about multi-level caches in general and the VIPT speed trick, and some cache parameters of some actual x86 CPUs.)


The private (per-core) L1D / L1I and L2 caches are traditional set-associative caches, often 8-way or 4-way for the small/fast caches. Cache line size is 64 bytes on all modern x86 CPUs. The data caches are write-back. (Except on AMD Bulldozer-family, where L1D is write-through with a small 4kiB write-combining buffer.)

http://www.7-cpu.com/ has good cache organization / latency numbers, and bandwidth, and TLB organization / performance numbers, for various microarchitectures, including many x86, like Haswell.

The "L0" decoded-uop cache in Intel Sandybridge-family is set-associative and virtually addressed. Up to 3 blocks of up to 6 uops can cache decode results from instructions in a 32-byte block of machine code. Related: Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs. (A uop cache is a big advance for x86: x86 instructions are variable-length and hard to decode fast / in parallel, so caching the internal decode results as well as the machine code (L1I$) has significant power and throughput advantages. Powerful decoders are still needed, because the uop cache isn't large; it's most effective in loops (including medium to large loops). This avoids the Pentium4 mistake (or limitation based on transitor size at the time) of having weak decoders and relying on the trace cache.)


Modern Intel (and AMD, I assume) L3 aka LLC aka last-level caches use an indexing function that isn't just a range of address bits. It's a hash function that better distributes things to reduce collisions from fixed strides. According to Intel my cache should be 24-way associative though its 12-way, how is that?.


From Nehalem onwards, Intel has used a large inclusive shared L3 cache, which filters coherency traffic between cores. i.e. when one core reads data which is in Modified state in L1d of another core, L3 tags say which core, so an RFO (Read For Ownership) can be sent only to that core, instead of broadcast. How are the modern Intel CPU L3 caches organized?. The inclusivity property is important, because it means no private L2 or L1 cache can have a copy of a cache line without L3 knowing about it. If it's in Exclusive or Modified state in a private cache, L3 will have Invalid data for that line, but the tags will still say which core might have a copy. Cores that definitely don't have a copy don't need to be sent a message about it, saving power and bandwidth over the internal links between cores and L3. See Why On-Chip Cache Coherence Is Here to Stay for more details about on-chip cache coherency in Intel "i7" (i.e. Nehalem and Sandybridge-family, which are different architectures but do use the same cache hierarchy).

Core2Duo had a shared last-level cache (L2), but was slow at generating RFO (Read-For-Ownership) requests on L2 misses. So bandwidth between cores with a small buffer that fits in L1d is as slow as with a large buffer that doesn't fit in L2 (i.e. DRAM speed). There's a fast range of sizes when the buffer fits in L2 but not L1d, because the writing core evicts its own data to L2 where the other core's loads can hit without generating an RFO request. (See Figure 3.27: Core 2 Bandwidth with 2 Threads in Ulrich Drepper's "What Every Programmer Should Know about Memory". (Full version here).


Skylake-AVX512 has larger per-core L2 (1MiB instead of 256k), and smaller L3 (LLC) slices per core. It's no longer inclusive. It uses a mesh network instead of a ring bus to connect cores to each other. See this AnandTech article (but it has some inaccuracies in the microarchitectural details on other pages, see the comment I left).

From Intel® Xeon® Processor Scalable Family Technical Overview

Due to the non-inclusive nature of LLC, the absence of a cache line in LLC does not indicate that the line is not present in private caches of any of the cores. Therefore, a snoop filter is used to keep track of the location of cache lines in the L1 or MLC of cores when it is not allocated in the LLC. On the previous-generation CPUs, the shared LLC itself took care of this task.

This "snoop-filter" is only useful if it can't have false negatives. It's ok to send an invalidate or RFO (MESI) to a core that doesn't have a copy of a line. It's not ok to let a core keep a copy of a line when another core is requesting exclusive access to it. So it may be a tag-inclusive tracker that knows which cores might have copies of which line, but which doesn't cache any data.

Or maybe the snoop filter can still be useful without being strictly inclusive of all L2 / L1 tags. I'm not an expert on multi-core / multi-socket snoop protocols. I think the same snoop filter may also help filter snoop requests between sockets. (In Broadwell and earlier, only quad-socket and higher Xeons have a snoop filter for inter-core traffic; dual-socket-only Broadwell Xeon and earlier don't filter snoop requests between the two sockets.)


AMD Ryzen uses separate L3 caches for clusters of cores, so data shared across many cores has to be duplicated in the L3 for each cluster. Also importantly, writes from a core in one cluster take longer to be visible to a core in another cluster, with the coherency requests having to go over an interconnect between clusters. (Similar to between sockets in a multi-socket Intel system, where each CPU package has its own L3.)

So this gives us NUCA (Non-Uniform Cache Access), analogous to the usual NUMA (Non-Uniform Memory Access) that you get in a multi-socket system where each processor has a memory controller built-in, and accessing local memory is faster than accessing memory attached to another socket.


Recent Intel multi-socket systems have configurable snoop modes so in theory you can tune the NUMA mechanism to work best for the workload you're running. See Intel's page about Broadwell-Xeon for a table + description of the available snoop modes.


Another advance / evolution is an adaptive replacement policy in the L3 on IvyBridge and later. This can reduce pollution when some data has temporal locality but other parts of the working set are much larger. (i.e. looping over a giant array with standard LRU replacement will evict everything, leaving L3 cache only caching data from the array that won't be touched again soon. Adaptive replacement tries to mitigate that problem.)


Further reading:

  • What Every Programmer Should Know About Memory?
  • Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?? (Single-threaded memory bandwidth on many-core Xeon CPUs is limited by max_concurrency / latency, not DRAM bandwidth).
  • http://users.atw.hu/instlatx64/ for memory-performance timing results
  • http://www.7-cpu.com/ for cache / TLB organization and latency numbers.
  • http://agner.org/optimize/ for microarchitectural details (mostly about the execution pipeline, not memory), and asm / C++ optimization guides.
  • Stack Overflow's x86 tag wiki has a performance section, with links to those and more.
like image 90
Peter Cordes Avatar answered Sep 21 '22 01:09

Peter Cordes