Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are caches of different level operating in the same frequency domain?

Larger caches are usually with longer bitlines or wordlines and thus most likely higher access latency and cycle time.

So, does L2 caches work in the same domain as L1 caches? How about L3 cache (slices) since they are now non-inclusive and shared among all the cores?

And related questions are: Are all function units in a core in the same clock domain? Are the uncore part all in the same clock domain? Are cores in the multi-core system synchronous?

I believe clock domain crossing would introduce extra latency. Do most parts in a CPU chip working on the same clock domain?

like image 227
Zine.Chant Avatar asked Dec 23 '22 00:12

Zine.Chant


2 Answers

The private L1i/d caches are always part of each core, not on a separate clock, in modern CPUs1. L1d is very tightly coupled with load execution units, and the L1dTLB. This is pretty universally true across architectures. (VIPT Cache: Connection between TLB & Cache?).

On CPUs with per-core private L2 cache, it's also part of the core, in the same frequency domain. This keeps L2 latency very low by keeping timing (in core clock cycles) fixed, and not requiring any async logic to transfer data across clock domains. This is true on Intel and AMD x86 CPUs, and I assume most other designs.

Footnote 1: Decades ago, when even having the L1 caches on-chip was a stretch for transistor budgets, sometimes just the comparators and maybe tags were on-chip, so that part could go fast while starting to set up the access to the data on external SRAM. (Or if not external, sometimes a separate die (piece of silicon) in the same plastic / ceramic package, so the wires could be very short and not exposed as external pins that might need ESD protection, etc).

Or for example early Pentium II ran its off-die / on-package L2 cache at half core clock speed (down from full speed in PPro). (But all the same "frequency domain"; this was before DVFS dynamic frequency/voltage for power management.) L1i/d was tightly integrated into the core like they still are today; you have to go farther back to find CPUs with off-die L1, like maybe early classic RISC CPUs.


The rest of this answer is mostly about Intel x86 CPUs, because from your mention of L3 slices I think that's what you're imagining.

How about L3 cache (slices) since they are now non-inclusive and shared among all the cores?

Of mainstream Intel CPUs (P6 / SnB-family), only Skylake-X has non-inclusive L3 cache. Intel since Nehalem has used inclusive last-level cache so its tags can be a snoop filter. See Which cache mapping technique is used in intel core i7 processor?. But SKX changed from a ring to a mesh, and made L3 non-inclusive / non-exclusive.


On Intel desktop/laptop CPUs (dual/quad), all cores (including their L1+L2 caches) are in the same frequency domain. The uncore (the L3 cache + ring bus) is in a separate frequency domain, but I think normally runs at the speed of the cores. It might clock higher than the cores if the GPU is busy but the cores are all idle.

The memory clock stays high even when the CPU clocks down. (Still, single-core bandwidth can suffer if the CPU decides to clock down from 4.0 to 2.7GHz because it's running memory-bound code on the only active core. Single-core bandwidth is limited by max_concurrency / latency, not by DRAM bandwidth itself if you have dual-channel DDR4 or DDR3. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? I think this is because of increased uncore latency.)

The wikipedia Uncore article mentions overclocking it separately from the cores to reduce L3 / memory latency.


On Haswell and later Xeons (E5 v3), uncore (the ring bus and L3 slices) and each individual core have separate frequency domains. (source: Frank Denneman's NUMA Deep Dive Part 2: System Architecture. It has a typo, saying Haswell (v4) when Haswell is actually Xeon E[357]-xxxx v3. But other sources like this paper Comparisons of core and uncore frequency scaling modes in quantum chemistry application GAMESS confirm that Haswell does have those features. Uncore Frequency Scaling (UFS) and Per Core Power States (PCPS) were both new in Haswell.


On Xeons before Haswell, the uncore runs at the speed of the current fastest core on that package. On a dual-socket NUMA setup, this can badly bottleneck the other socket, by making it slow keeping up with snoop requests. See John "Dr. Bandwidth" McCalpin's post on this Intel forum thread:

On the Xeon E5-26xx processors, the "uncore" (containing the L3 cache, ring interconnect, memory controllers, etc), runs at a speed that is no faster than the fastest core, so the "package C1E state" causes the uncore to also drop to 1.2 GHz. When in this state, the chip takes longer to respond to QPI snoop requests, which increases the effective local memory latency seen by the processors and DMA engines on the other chip!

... On my Xeon E5-2680 chips, the "package C1E" state increases local latency on the other chip by almost 20%

The "package C1E state" also reduces sustained bandwidth to memory located on the "idle" chip by up to about 25%, so any NUMA placement errors generate even larger performance losses.

Dr. Bandwidth ran a simple infinite-loop pinned to a core on the other socket to keep it clocked up, and was able to measure the difference.

Quad-socket-capable Xeons (E7-xxxx) have a small snoop filter cache in each socket. Dual-socket systems simply spam the other socket with every snoop request, using a good fraction of the QPI bandwidth even when they're accessing their own local DRAM after an L3 miss.


I think Broadwell and Haswell Xeon can keep their uncore clock high even when all cores are idle, exactly to avoid this bottleneck.

Dr. Bandwidth says he disables package C1E state on his Haswell Xeons, but that probably wasn't necessary. He also posted some stuff about using Uncore perf counters to measure uncore frequency to find out what your CPU is really doing, and about BIOS settings that can affect the uncore frequency decision-making.


More background: I found https://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4 about some changes like new snoop mode options (which hop on the ring bus sends snoops to the other core), but it doesn't mention clocks.

like image 161
Peter Cordes Avatar answered Dec 31 '22 14:12

Peter Cordes


A larger cache may have a higher access time, but still it could sustain one access per cycle per port by fully pipelining it. But it also may constrain the maximum supported frequency.

In modern Intel processors, the L1i/L1d and L2 caches and all functional units of a core are in the same frequency domain. On client processors, all cores of the same socket are also in the same frequency domain because they share the same frequency regulator. On server processors (starting with Haswell I think), each core in a separate frequency domain.

In modern Intel processors (since Nehalem I think), the uncore (which includes the L3) is in a separate frequency domain. One interesting case is when a socket is used in a dual NUMA nodes configuration. In this case, I think the uncore partition of each NUMA node would still both exist in the same frequency domain.

There is a special circuitry used to cross frequency domains where all cross-domain communication has to pass through it. So yes I think it incurs a small performance overhead.

There are other frequency domains. In particular, each DRAM channel operates in a frequency domains. I don't know whether current processors support having different channels to operate at different frequencies.

like image 37
Hadi Brais Avatar answered Dec 31 '22 14:12

Hadi Brais