How are cache memories shared in multicore Intel CPUs?

Tags:

I have a few questions regarding Cache memories used in Multicore CPUs or Multiprocessor systems. (Although not directly related to programming, it has many repercussions while one writes software for multicore processors/multiprocessors systems, hence asking here!)

In a multiprocessor system or a multicore processor (Intel Quad Core, Core two Duo etc..) does each cpu core/processor have its own cache memory (data and program cache)?
Can one processor/core access each other's cache memory, because if they are allowed to access each other's cache, then I believe there might be lesser cache misses, in the scenario that if that particular processors cache does not have some data but some other second processors' cache might have it thus avoiding a read from memory into cache of first processor? Is this assumption valid and true?
Will there be any problems in allowing any processor to access other processor's cache memory?

271

asked Jun 03 '09 14:06

goldenmean

2 Answers

In a multiprocessor system or a multicore processor (Intel Quad Core, Core two Duo etc..) does each cpu core/processor have its own cache memory (data and program cache)?

Yes. It varies by the exact chip model, but the most common design is for each CPU core to have its own private L1 data and instruction caches.

On old and/or low-power CPUs, the next level of cache is typically a L2 unified cache is typically shared between all cores. Or on 65nm Core2Quad (which was two core2duo dies in one package), each pair of cores had their own last-level cache and couldn't communicate as efficiently.

Modern mainstream Intel CPUs (since the first-gen i7 CPUs, Nehalem) use 3 levels of cache.

32kiB split L1i/L1d: private per-core (same as earlier Intel)
256kiB unified L2: private per-core. (1MiB on Skylake-avx512).
large unified L3: shared among all cores

Last-level cache is a a large shared L3. It's physically distributed between cores, with a slice of L3 going with each core on the ring bus that connects the cores. Typically 1.5 to 2.25MB of L3 cache with every core, so a many-core Xeon might have a 36MB L3 cache shared between all its cores. This is why a dual-core chip has 2 to 4 MB of L3, while a quad-core has 6 to 8 MB.

On CPUs other than Skylake-avx512, L3 is inclusive of the per-core private caches so its tags can be used as a snoop filter to avoid broadcasting requests to all cores. i.e. anything cached in a private L1d, L1i, or L2, must also be allocated in L3. See Which cache mapping technique is used in intel core i7 processor?

David Kanter's Sandybridge write-up has a nice diagram of the memory heirarchy / system architecture, showing the per-core caches and their connection to shared L3, and DDR3 / DMI(chipset) / PCIe connecting to that. (This still applies to Haswell / Skylake-client / Coffee Lake, except with DDR4 in later CPUs).

Can one processor/core access each other's cache memory, because if they are allowed to access each other's cache, then I believe there might be lesser cache misses, in the scenario that if that particular processors cache does not have some data but some other second processors' cache might have it thus avoiding a read from memory into cache of first processor? Is this assumption valid and true?

No. Each CPU core's L1 caches tightly integrate into that core. Multiple cores accessing the same data will each have their own copy of it in their own L1d caches, very close to the load/store execution units.

The whole point of multiple levels of cache is that a single cache can't be fast enough for very hot data, but can't be big enough for less-frequently used data that's still accessed regularly. Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?

Going off-core to another core's caches wouldn't be faster than just going to L3 in Intel's current CPUs. Or the required mesh network between cores to make this happen would be prohibitive compared to just building a larger / faster L3 cache.

The small/fast caches built-in to other cores are there to speed up those cores. Sharing them directly would probably cost more power (and maybe even more transistors / die area) than other ways of increasing cache hit rate. (Power is a bigger limiting factor than transistor count or die area. That's why modern CPUs can afford to have large private L2 caches).

Plus you wouldn't want other cores polluting the small private cache that's probably caching stuff relevant to this core.

Will there be any problems in allowing any processor to access other processor's cache memory?

Yes -- there simply aren't wires connecting the various CPU caches to the other cores. If a core wants to access data in another core's cache, the only data path through which it can do so is the system bus.

A very important related issue is the cache coherency problem. Consider the following: suppose one CPU core has a particular memory location in its cache, and it writes to that memory location. Then, another core reads that memory location. How do you ensure that the second core sees the updated value? That is the cache coherency problem.

The normal solution is the MESI protocol, or a variation on it. Intel uses MESIF.

answered Sep 20 '22 09:09

Adam Rosenfield

Quick answers 1) Yes 2)No, but it all may depend on what memory instance/resource you are referring, data may exist in several locations at the same time. 3)Yes.

For a full length explanation of the issue you should read the 9 part article "What every programmer should know about memory" by Ulrich Drepper ( http://lwn.net/Articles/250967/ ), you will get the full picture of the issues you seem to be inquiring about in a good and accessible detail.

answered Sep 19 '22 09:09

Panic

Related questions
                            
                                Why is this SSE code 6 times slower without VZEROUPPER on Skylake?
                            
                                Memory Allocation/Deallocation Bottleneck?
                            
                                Most efficient way to check if an object is a value type
                            
                                Why is writing to memory much slower than reading it?
                            
                                Why is pow(int, int) so slow?
                            
                                ' ... != null' or 'null != ....' best performance?
                            
                                shared_ptr: horrible speed
                            
                                JMeter understanding ramp-up
                            
                                Pointer vs Variable speed in C++
                            
                                Which loop has better performance? Why?
                            
                                Why is the F# version of this program 6x faster than the Haskell one?
                            
                                How can I programmatically limit my program's CPU usage to below 70%?
                            
                                Why is Azure deployment slower on Windows 2012 Server than on Windows 2008 Server
                            
                                Why is Numpy with Ryzen Threadripper so much slower than Xeon?
                            
                                Efficient implementation of binary heaps
                            
                                When does reflow happen in a DOM environment?
                            
                                Why is linear read-shuffled write not faster than shuffled read-linear write?
                            
                                Optimize mySql for faster alter table add column
                            
                                Java Performance Testing [duplicate]
                            
                                Why are there memory allocations when calling a func

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How are cache memories shared in multicore Intel CPUs?

Tags:

performance

x86

multiprocessing

cpu-cache

intel

goldenmean

People also ask

2 Answers

Adam Rosenfield

Panic

Recent Activity

Donate For Us