Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the indexing of the Ice Lake's 48KiB L1 data cache work?

The Intel manual optimization (revision September 2019) shows a 48 KiB 8-way associative L1 data cache for the Ice Lake microarchitecture.

Ice Lake's 48KiB L1 Data cache and its 8-way associativity 1 Software-visible latency/bandwidth will vary depending on access patterns and other factors.

This baffled me because:

  • There are 96 sets (48 KiB / 64 / 8), which is not a power of two.
  • The indexing bits of a set and the indexing bits of the byte offset add to more than 12 bits, this makes the cheap-PIPT-as-VIPT-trick not available for 4KiB pages.

All in all, it seems that the cache is more expensive to handle but the latency increased only slightly (if it did at all, depending on what Intel means exactly with that number).

With a bit of creativity, I can still imagine a fast way to index 96 sets but point two seems an important breaking change to me.

What am I missing?

like image 945
Margaret Bloom Avatar asked Jan 19 '20 12:01

Margaret Bloom


2 Answers

The optimization manual is wrong.

According to the CPUID instruction, the associativity is 12 (on a Core i5-1035G1). See also uops.info/cache.html and en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(client).

This means that there are 64 sets, which is the same as in previous microarchitectures.

like image 135
Andreas Abel Avatar answered Oct 22 '22 17:10

Andreas Abel


Both the optimization manual and the datasheet of the processor family (Section 2.4.2) mention that the L1 data cache is 8-way associative. Another source is InstLatx64, which provides cpuid dumps for many processors including Ice Lake processors. Take for example the dump for i7-1065G7

CPUID 00000004: 1C004121-02C0003F-0000003F-00000000 [SL 00]

Cache information can be found in cpuid leaf 0x4. The Intel SDM Volume 2 discusses how to decode these bytes. Bits 31 - 22 of EBX (the second from the left) represent the number of ways minus one. These bits in binary are 1011, which is 11 in decimal. So cpuid says that there are 12 ways. Other information we can obtain from here is that the L1 data cache is 48KB in size, with 64-byte cache line size, and uses the simple addressing scheme. So based on the cpuid information, bits 11-6 of the address represent the cache set index.

So which one is right? The optimization manual could be wrong (and that wouldn't be the first time), but also the cpuid dump could be buggy (and that also wouldn't be the first time). Well, both could be wrong, but this is historically much less likely. Other examples of discrepancies between the manual and cpuid information are discussed here, so we know that errors exist in both sources. Moreover, I'm not aware of any other Intel source that mentions the number of ways in the L1D. Of course, non-Intel sources could be wrong as well.

Having 8 ways with 96 sets would result in an unusual design and unlikely to happen without more than a mere mention of a single number in the optimization manual (although that doesn't necessarily mean that the cache has to have 12 ways). This by itself makes the manual more likely to be wrong here.

Fortunately, Intel does document implementation bugs in their processors in the spec update documents. We can check with spec update document for the Ice Lake processors, which you can find here. Two cpuid bugs are documented there:

CPUID TLB Information is Inaccurate

I've already discussed this issue in my answer on Understanding TLB from CPUID results on Intel. The second bug is:

CPUID L2 Cache Information May Be Inaccurate

This is not relevant to your question.

The fact that the spec update document mentions some cpuid bugs strongly suggests that the information from cpuid leaf 0x4 was validated by Intel and is accurate. So the optimization manual (and the datasheet) is probably wrong in this case.

like image 35
Hadi Brais Avatar answered Oct 22 '22 17:10

Hadi Brais