Skylake L2 cache enhanced by reducing associativity?

Tags:

In Intel's optimization guide, section 2.1.3, they list a number of enhancements to the caches and memory subsystem in Skylake (emphasis mine):

The cache hierarchy of the Skylake microarchitecture has the following enhancements:

Higher Cache bandwidth compared to previous generations.

Simultaneous handling of more loads and stores enabled by enlarged buffers.

Processor can do two page walks in parallel compared to one in Haswell microarchitecture and earlier generations.

Page split load penalty down from 100 cycles in previous generation to 5 cycles.

L3 write bandwidth increased from 4 cycles pe r line in previous generation to 2 per line.

Support for the CLFLUSHOPT instruction to flush ca che lines and manage memory ordering of flushed data using SFENCE.

Reduced performance penalty for a software prefetch that specifies a NULL pointer.

L2 associativity changed from 8 ways to 4 ways.

The final one caught my eye. In what way is a reduction in the number of ways an enhancement? By itself, it seems that fewer ways is strictly worse than more ways. Of course, I get that there might be valid engineering reasons why a reduction in the number of ways could be a tradeoff that enables other enhancements, but here it is positioned, by itself, as an enhancement.

What am I missing?

792

asked Jun 22 '16 01:06

BeeOnRope

1 Answers

It's strictly worse for performance of the L2 cache.

According to this AnandTech writeup of SKL-SP (aka skylake-avx512 or SKL-X), Intel has stated that "the main reason [for reducing associativity] was to make the design more modular". Skylake-AVX512 has 1MiB of L2 cache with 16-way associativity.

Presumably the drop to 4-way associativity doesn't hurt too badly in the dual and quad-core laptop and desktop parts (SKL-S), since there's lots of bandwidth to L3 cache. I think if Intel's simulations and testing had found that it hurt a lot, they would have put in the extra design time to keep the 8-way 256k cache on non-AVX512 Skylake.

The upside of lower associativity is power budget. It could indirectly help performance by allowing more turbo headroom, but mostly they did it to improve efficiency, NOT to improve speed. Freeing up some room in the power budget allows them to spend it elsewhere. Or not to spend all of it, and just use less power.

Mobile and many-core-server CPUs care a lot about power budget, much more than high-end quad-core desktop CPUs.

The heading on the list should more accurately read "changes", not "enhancements", but I'm sure the marketing department wouldn't let them write anything that didn't sound positive. :P At least Intel documents things accurately and in detail, including the ways new CPUs are worse than older designs.

Anandtech's SKL writeup suggests that dropping the associativity freed up the power budget to increase L2 bandwidth, which (in the big picture) compensates for the increased miss rate.

IIRC, Intel has a policy that any proposed design change must have a 2:1 ratio of perf gain to power cost, or something like that. So presumably if they lost 1% performance but save 3% power with this L2 change, they do it. The 2:1 number might be correct, if I'm remembering this correctly, but the 1% and 3% example are totally made up.

There was some discussion of this change in one of the podcast interviews David Kanter did right after details were released at IDF. IDK if this is the right link.

answered Sep 28 '22 08:09

Peter Cordes

Related questions
                            
                                Why are these 8 byte-writes not optimized into a MOV?
                            
                                How to solve qemu gdb debug error: Remote 'g' packet reply is too long?
                            
                                Which x86 instruction has a 10-byte immediate?
                            
                                Outputting integers in assembly on Linux
                            
                                Why does Hyper-threading get reported as supported on processors without it?
                            
                                How can I detect when Android x86 is emulating ARM?
                            
                                Automatically generate FMA instructions in MSVC
                            
                                How to ask GCC to completely unroll this loop (i.e., peel this loop)?
                            
                                Can the LSD issue uOPs from the next iteration of the detected loop?
                            
                                Utilizing the LDT (Local Descriptor Table)
                            
                                Why do I get a different SHA1 hash between Powershell and 32bit-Python on a system DLL?
                            
                                Why is one of these sooooo much faster than the other?
                            
                                Is it legal to optimize away stores/construction of volatile stack variables?
                            
                                How to force NASM to encode [1 + rax*2] as disp32 + index*2 instead of disp8 + base + index?
                            
                                Switch Case Assembly Language
                            
                                Scope of MXCSR control register?
                            
                                _mm_set_epi8 - what does "set" mean?
                            
                                SSE2 instruction to load integers in reverse order
                            
                                Intel x86 to ARM assembly conversion
                            
                                Generating a random number within range of 0-9 in x86 8086 Assembly

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Skylake L2 cache enhanced by reducing associativity?

Tags:

x86

cpu-cache

cpu

intel

BeeOnRope

People also ask

1 Answers

Peter Cordes

Recent Activity

Donate For Us