In Intel's optimization guide, section 2.1.3, they list a number of enhancements to the caches and memory subsystem in Skylake (emphasis mine):
The cache hierarchy of the Skylake microarchitecture has the following enhancements:
- Higher Cache bandwidth compared to previous generations.
- Simultaneous handling of more loads and stores enabled by enlarged buffers.
- Processor can do two page walks in parallel compared to one in Haswell microarchitecture and earlier generations.
- Page split load penalty down from 100 cycles in previous generation to 5 cycles.
- L3 write bandwidth increased from 4 cycles pe r line in previous generation to 2 per line.
- Support for the CLFLUSHOPT instruction to flush ca che lines and manage memory ordering of flushed data using SFENCE.
- Reduced performance penalty for a software prefetch that specifies a NULL pointer.
- L2 associativity changed from 8 ways to 4 ways.
The final one caught my eye. In what way is a reduction in the number of ways an enhancement? By itself, it seems that fewer ways is strictly worse than more ways. Of course, I get that there might be valid engineering reasons why a reduction in the number of ways could be a tradeoff that enables other enhancements, but here it is positioned, by itself, as an enhancement.
What am I missing?
Building a large cache with these properties is impossible. Thus, designers keep it small, e.g. 32KB in most processors today. L2 is accessed only on L1 misses, so accesses are less frequent (usually 1/20th of the L1). Thus, L2 can have higher latency (e.g. from 10 to 20 cycles) and have fewer ports.
(Level 2 cache) A memory bank built into the CPU chip, packaged within the same module or built on the motherboard. The L2 cache feeds the L1 cache, which feeds the processor. L2 memory is slower than L1 memory.
L1 is "level-1" cache memory, usually built onto the microprocessor chip itself. For example, the Intel MMX microprocessor comes with 32 thousand bytes of L1. L2 (that is, level-2) cache memory is on a separate chip (possibly on an expansion card) that can be accessed more quickly than the larger "main" memory.
The L2 cache size varies depending on the CPU, but its size is typically between 256KB to 8MB. Most modern CPUs will pack more than a 256KB L2 cache, and this size is now considered small. Furthermore, some of the most powerful modern CPUs have a larger L2 memory cache, exceeding 8MB.
It's strictly worse for performance of the L2 cache.
According to this AnandTech writeup of SKL-SP (aka skylake-avx512 or SKL-X), Intel has stated that "the main reason [for reducing associativity] was to make the design more modular". Skylake-AVX512 has 1MiB of L2 cache with 16-way associativity.
Presumably the drop to 4-way associativity doesn't hurt too badly in the dual and quad-core laptop and desktop parts (SKL-S), since there's lots of bandwidth to L3 cache. I think if Intel's simulations and testing had found that it hurt a lot, they would have put in the extra design time to keep the 8-way 256k cache on non-AVX512 Skylake.
The upside of lower associativity is power budget. It could indirectly help performance by allowing more turbo headroom, but mostly they did it to improve efficiency, NOT to improve speed. Freeing up some room in the power budget allows them to spend it elsewhere. Or not to spend all of it, and just use less power.
Mobile and many-core-server CPUs care a lot about power budget, much more than high-end quad-core desktop CPUs.
The heading on the list should more accurately read "changes", not "enhancements", but I'm sure the marketing department wouldn't let them write anything that didn't sound positive. :P At least Intel documents things accurately and in detail, including the ways new CPUs are worse than older designs.
Anandtech's SKL writeup suggests that dropping the associativity freed up the power budget to increase L2 bandwidth, which (in the big picture) compensates for the increased miss rate.
IIRC, Intel has a policy that any proposed design change must have a 2:1 ratio of perf gain to power cost, or something like that. So presumably if they lost 1% performance but save 3% power with this L2 change, they do it. The 2:1 number might be correct, if I'm remembering this correctly, but the 1% and 3% example are totally made up.
There was some discussion of this change in one of the podcast interviews David Kanter did right after details were released at IDF. IDK if this is the right link.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With