Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding TLB from CPUID results on Intel

I'm exploring leaf 0x02 of the cpuid instruction and came up with a few questions. There is a table in the documentation which describes what cpuid results mean for the TLB configuration. Here they are:

case 1

56H TLB Data TLB0: 4 MByte pages, 4-way set associative, 16 entries
[...]
B4H TLB Data TLB1: 4 KByte pages, 4-way associative, 256 entries

Does it mean that there are only 2 levels of TLB? How to query the number of levels of TLB cache in case some x86 vendor decides to provide 3 levels of TLB?

case 2

57H TLB Data TLB0: 4 KByte pages, 4-way associative, 16 entries
[...] 
B4H TLB Data TLB1: 4 KByte pages, 4-way associative, 256 entries

Is "4-way associative" here just a typo meaning that "4-way set associative"?

case 3

55H TLB Instruction TLB: 2-MByte or 4-MByte pages, fully associative, 7 entries
[...]
6AH Cache uTLB: 4 KByte pages, 8-way set associative, 64 entries
6BH Cache DTLB: 4 KByte pages, 8-way set associative, 256 entries

Does DTLB stand for Data TLB? What does uTLB mean? uops-TLB? Which TLB cache level is considered here?

case 4

C1H STLB Shared 2nd-Level TLB: 4 KByte/2MByte pages, 8-way associative, 1024 entries

Does this mean that in that case the 2nd level TLB is shared among all cores? So when not specified explicitly is the TLB cache core private?

like image 633
St.Antario Avatar asked Sep 27 '19 06:09

St.Antario


1 Answers

How to query the number of levels of TLB cache in case some x86 vendor decides to provide 3 levels of TLB?

Leaf 0x2 may return TLB information only on Intel processors. It's reserved on all current AMD processors. On all current Intel processors, there is no single number that tells you the number of TLB levels. The only way to determine the number of levels is by enumerating all the TLB-related cpuid leafs or subleafs. The following algorithm works on all current Intel processors that support the cpuid instruction (up to and including Ice Lake, Goldmont Plus, and Knights Mill):

  1. Check whether the value 0xFE exists in any of the four registers EAX, EBX, ECX, and EDX returned when cpuid is executed with EAX set to leaf 0x2.
  2. If 0xFE doesn't exist, enumerate all the bytes in the four registers. Based on Table 3-12 of the Intel manual Volume 2 (number 325383-070US), there will be either one or two descriptors of data TLBs that can cache 4KB translations. The Intel manual uses the following different names for TLBs that may cache data access translations: Data TLB, Data TLB0, Data TLB1, DTLB, uTLB, and Shared 2nd-Level TLB. If there are two such descriptors, then the number of levels is two. The descriptor with the larger number of TLB numbers is the one for the second-level TLB. If there is only one such descriptor, the number of levels is one.
  3. If 0xFE exists, the TLB information needs to be obtained from cpuid leaf 0x18. Enumerate all the valid subleafs up to the maximum valid subleaf number. If there is at least one subleaf with the least two significant bits of EDX equal to 11, then the number of TLB levels is two. Otherwise, the number of TLB levels is one.

The TLB information for Ice Lake and Goldmont Plus processors is present in leaf 0x18. This leaf provides more flexibility in encoding TLB information. The TLB information for all other current Intel processors is present in leaf 0x2. I don't know about Knights Mill (if someone has access to a Knights Mill, please consider sharing the cpuid dump).

Determining the number of TLB levels is not sufficient to fully describe how the levels are related to each other. Current Intel processors implement two different 2-level TLB hierarchies:

  • The second-level TLB can cache translations for data loads (including prefetches), data stores, and instruction fetches. The second-level TLB is called in this case "Shared 2nd-Level TLB."
  • The second-level TLB can cache translations for data loads and stores, but not instruction fetches. The second-level TLB is called in this case any of the following: Data TLB, Data TLB1, or DTLB.

I'll discuss a couple of examples based on the cpuid dumps from InstLatx64. On one of the Haswell processors with hyperthreading enabled, leaf 0x2 provides the following information in the four registers:

76036301-00F0B5FF-00000000-00C10000

There is no 0xFE, so the TLB information is present in this leaf itself. According to Table 3-12:

76: Instruction TLB: 2M/4M pages, fully associative, 8 entries
03: Data TLB: 4 KByte pages, 4-way set associative, 64 entries
63: Data TLB: 2 MByte or 4 MByte pages, 4-way set associative, 32 entries and a separate array with 1 GByte pages, 4-way set associative, 4 entries
B5: Instruction TLB: 4KByte pages, 8-way set associative, 64 entries
C1: Shared 2nd-Level TLB: 4 KByte/2MByte pages, 8-way associative, 1024 entries

The other bytes are not relevant to TLBs.

There is one discrepancy compared to Table 2-17 of the Intel optimization manual (number 248966-042b). Table 2-17 mentions that the instruction TLB for 4KB entries has 128 entries, 4-way associative, and is dynamically partitioned between the two hyperthreads. But the TLB dump says that it's 8-way associative and there are only 64 entries. There is actually no encoding for a 4-way ITLB with 128-entries, so I think the manual is wrong. Anyway, C1 shows that there are two TLB levels and the second level caches data and instruction translations.

On one of the Goldmont processors, leaf 0x2 provides the following information in the four registers:

6164A001-0000FFC4-00000000-00000000

Here is the interpretation of the TLB-relevant bytes:

61: Instruction TLB: 4 KByte pages, fully associative, 48 entries
64: Data TLB: 4 KByte pages, 4-way set associative, 512 entries
A0: DTLB: 4k pages, fully associative, 32 entries
C4: DTLB: 2M/4M Byte pages, 4-way associative, 32 entries

There are two data TLBs for 4KB pages, one has 512 entries and the other has 32 entries. This means that the processor has two levels of TLBs. The second level is called "Data TLB" and so it can only cache data translations.

Table 19-4 of the optimization manual mentions that the ITLB in Goldmont supports large pages, but this information is not present in the TLB information. The data TLB information is consistent with Table 19-7 of the manual, except that the "Data TLB" and "DTLB" are called "DTLB" and "uTLB", respectively, in the manual.

On one of the Knights Landing processors, leaf 0x2 provides the following information in the four registers:

6C6B6A01-00FF616D-00000000-00000000
6C: DTLB: 2M/4M pages, 8-way set associative, 128 entries
6B: DTLB: 4 KByte pages, 8-way set associative, 256 entries
6A: uTLB: 4 KByte pages, 8-way set associative, 64 entries
61: Instruction TLB: 4 KByte pages, fully associative, 48 entries
6D: DTLB: 1 GByte pages, fully associative, 16 entries

So there are two TLB levels. The first one consists of multiple structures for different page sizes. The TLB for 4KB pages is called uTLB and the TLBs for the other pages sizes are called DTLBs. The second level TLB is called DTLB. These numbers and names are consistent with Table 20-3 from the manual.

Silvermont processors provide the following TLB information:

61B3A001-0000FFC2-00000000-00000000
61: Instruction TLB: 4 KByte pages, fully associative, 48 entries
B3: Data TLB: 4 KByte pages, 4-way set associative, 128 entries
A0: DTLB: 4k pages, fully associative, 32 entries
C2: DTLB: 4 KByte/2 MByte pages, 4-way associative, 16 entries

This information is consistent with the manual, except for C2. I think it should say "4 MByte/2 MByte" instead of "4 KByte/2 MByte." It's probably a typo in the manual.

The Intel Penryn microarchitecture is an example where the TLB information uses the names TLB0 and TLB1 to refer to the first and second level TLBs:

05: Data TLB1: 4 MByte pages, 4-way set associative, 32 entries
B0: Instruction TLB: 4 KByte pages, 4-way set associative, 128 entries
B1: Instruction TLB: 2M pages, 4-way, 8 entries or 4M pages, 4-way, 4 entries
56: Data TLB0: 4 MByte pages, 4-way set associative, 16 entries
57: Data TLB0: 4 KByte pages, 4-way associative, 16 entries
B4: Data TLB1: 4 KByte pages, 4-way associative, 256 entries

Older Intel processors have single-level TLB hierarchies. For example, here is the TLB information for Prescott:

5B: Data TLB: 4 KByte and 4 MByte pages, 64 entries
50: Instruction TLB: 4 KByte and 2-MByte or 4-MByte pages, 64 entries

All Intel 80386 processors and some Intel 80486 processors include a single-level TLB hierarchy, but don't support the cpuid instruction. On processors earlier than 80386, there is no paging. If you want the algorithm above to work on all Intel x86 processors, you'll have to consider these cases as well. The Intel document number 241618-025 titled "Processor Identification and the CPUID Instruction," which can be found here, discusses how to handle these cases in Chapter 7.

I'll discuss an example where the TLB information is present in leaf 0x18 rather than leaf 0x2. Like I said earlier, the only existing Intel processors that have the TLB information present in 0x18 are Ice Lake and Goldmont Plus processors (and maybe Knights Mill). The leaf 0x2 dump for an Ice Lake processor is:

00FEFF01-000000F0-00000000-00000000

There is an 0xFE byte, so the TLB information is present in the more powerful leaf 0x18. Subleaf 0x0 of leaf 0x18 specifies that the maximum valid subleaf is 0x7. Here are the dumps for subleafs 0x0 to 0x7:

00000007-00000000-00000000-00000000 [SL 00]
00000000-00080007-00000001-00004122 [SL 01]
00000000-0010000F-00000001-00004125 [SL 02]
00000000-00040001-00000010-00004024 [SL 03]
00000000-00040006-00000008-00004024 [SL 04]
00000000-00080008-00000001-00004124 [SL 05]
00000000-00080007-00000080-00004043 [SL 06]
00000000-00080009-00000080-00004043 [SL 07]

The Intel manual describes how to decode these bits. Each valid subleaf describes a single TLB structure. A subleaf is valid (i.e., describes a TLB structure) if the least significant five bits of EDX are not all zeros. Hence, subleaf 0x0 is invalid. The next seven subleafs are all valid, which means that there are 7 TLB descriptors in an Ice Lake processor. The least significant five bits of EDX specify the type of the TLB and the next three bits specify the level of the TLB. The following information is obtained by decoding the subleaf bits:

  • [SL 01]: Describes a first-level instruction TLB that is an 8-way fully associative cache capable of caching translations for 4KB, 2MB, and 4MB pages.
  • [SL 02]: The least significant five bits represent the number 5, which is a reserved encoding according to the most recent version of the manual (Volume 2). The other bits specify a TLB that is 16-way fully associative and capable of caching translations for all page sizes. Intel has provided information on the TLBs in Ice Lake in Table 2-5 of the optimization manual. The closest match shows that the reserved encoding 5 most likely represents a first-level TLB for data store translations.
  • [SL 03]: The least significant five bits represent the number 4, which is also a reserved encoding according to the most recent version of the manual. The closest match with Table 2-5 suggests that it represents a first-level TLB for data loads that can cache 4KB translations. The number of ways and sets matches Table 2-5.
  • [SL 04]: Similar to subleaf 0x3. The closest match with Table 2-5 suggests that it represents a first-level TLB for data loads that can cache 2MB and 4MB translations. The number of ways and sets matches Table 2-5.
  • [SL 05]: Similar to subleaf 0x3. The closest match with Table 2-5 suggests that it represents a first-level TLB for data loads that can cache 1GB translations. The number of ways and sets matches Table 2-5.
  • [SL 06]: Describes a second-level unified TLB consisting of 8 ways and 128 sets and capable of caching translations for 4KB, 2MB, and 4MB pages.
  • [SL 07]: Describes a second-level unified TLB consisting of 8 ways and 128 sets and capable of caching translations for 4KB and 1GB pages.

Table 2-5 actually mentions that there is only one unified TLB structure, but half of the ways can only cache translations for 4KB, 2MB, and 4MB pages and the other half can only cache translations for 4KB and 1GB pages. So the TLB information for the second-level TLB is consistent with the manual. However, the TLB information for the instruction TLB is not consistent with Table 2-5. The manual is probably correct. The ITLB for 4KB pages seems to be mixed up with that for 2MB and 4MB pages in the TLB information dump.

On AMD processors, the TLB information for the first-level and second-level TLBs is provided in leafs 8000_0005 and 8000_0006, respectively. More information can be found in the AMD manual Volume 3. AMD processors earlier than the K5 don't support the cpuid and some of these processors include a single-level TLB. So if you care about these processors, you need an alternative mechanism to determine whether a TLB exists. Zen 2 adds 1GB support at both TLB levels. Information on these TLBs can be found in leaf 8000_0019.

AMD Zen has a three-level instruction TLB hierarchy according to AMD. This is the first core microarchitecture that I know of that uses a three-level TLB hierarchy. Most probably this is also the case on AMD Zen+ and AMD Zen 2 (but I couldn't find an AMD source that confirms this). There appears to be no documented cpuid information on the L0 ITLB. So you'll probably have to check whether the processor is AMD Zen or later and provide the L0 ITLB information (8 entries for all page sizes, probably fully associative) manually for these processors.

Is "4-way associative" here just a typo meaning that "4-way set associative"?

It's not a typo. These terms are synonyms and both are commonly used.

Does DTLB stand for Data TLB? What does uTLB mean? uosp-TLB? Which TLB cache level is considered here?

DTLB and uTLB are both names for data TLBs. The DTLB name is used for both the first-level and second-level TLBs. The uTLB name is only used for the first-level data TLB and is short for micro-TLB.

Does this mean that in that case the 2-nd level TLB is shared among all cores? So when not specified explicitly is the TLB cache core private?

The term "shared" here means "unified" as in both data and instruction translations can be cached. Intel should have called it UTLB (capital U) or Unified TLB, which is the name used in the modern leaf 0x18.

like image 110
Hadi Brais Avatar answered Sep 22 '22 00:09

Hadi Brais