Does a hyper-threaded core share MMU and TLB?

1 Answers

The TLBs are organized in Intel processors as follows:

Intel NetBurst (the first to support HT): The ITLB is replicated. The DTLB is competitively shared.
Intel Nehalem (the second to support HT), Westmere, Sandy Bridge, and Ivy Bridge: The huge page ITLB is replicated. The small page ITLB is statically partitioned. All DTLBs are competitively shared.
Intel Haswell, Broadwell, and Skylake: The small page ITLB is dynamically partitioned. The huge page ITLB is replicated. Table 2-12 of the optimization manual (September 2019) says that the policy is "fixed" for the other TLBs. I thought this means static partitioning. But according to the experimental results of the paper titled Translation Leak-aside Buffer: Defeating Cache Side-channel Protections with TLB Attacks (Section 6), it appears that "fixed" means competitive sharing. That would be consistent with earlier and later microarchitectures.
Sunny Cove: The ITLBs are statically partitioned. All DTLBs and the STLB are competitively shared.
AMD Zen, Zen+, Zen 2 (Family 17h): All TLBs are competitively shared.

References:

For NetBurst: https://software.intel.com/en-us/articles/introduction-to-hyper-threading-technology.
For the other Intel microarchitectures: The information can be found in the Optimization Reference Manual.
For the AMD microarchitectures: The information can be found in the Software Optimization Guide.

It's not clear to me how the TLBs are organized in any of the Intel Atom microarchitectures. I think that the L1 DTLB and STLB (in Goldmont Plus) or L2 DTLB (in earlier microarchitectures) are shared. According to Section 8.7.13.2 of the Intel SDM V3 (October 2019):

In processors supporting Intel Hyper-Threading Technology, data cache TLBs are shared. The instruction cache TLB may be duplicated or shared in each logical processor, depending on implementation specifics of different processor families.

Although this is not accurate since an ITLB can be partitioned as well.

I don't know about the ITLBs in Intel Atoms.

(By the way, in older AMD processors, all the TLBs are replicated per core. See: Physical core and Logical cores on different cpu AMD/Intel.)

When a TLB is shared, each TLB entry is tagged with the logical processor ID (a single bit value, which is different from the process-context identifier, which can be disabled or may not be supported) that allocated it. If another thread gets scheduled to run on a logical core and the thread accesses a different virtual address space than the previous thread, the OS has to load the corresponding base physical address of the first-level page structure into CR3. Whenever CR3 is written to, the core automatically flushes all entries in all shared TLBs that are tagged with the ID of the logical core. There are other operations that may trigger this flushing.

Partitioned and replicated TLBs don't need to be tagged with logical core IDs.

If process-context identifiers (PCIDs) are supported and enabled, logical core IDs are not used because PCIDs are more powerful. Note that partitioned and replicated TLBs are tagged with PCIDs.

Related: Address translation with multiple pagesize-specific TLBs.

(Note that there are other paging structure caches and they are organized similarly.)

(Note that usually the TLB is considered to be part of the MMU. The Wikipedia article on MMU shows a figure from an old version of a book that indicates that they are separate. However, the most recent version of the book has removed the figure and says that the TLB is part of the MMU.)

answered Oct 11 '22 00:10

Hadi Brais

Related questions
                            
                                Setting up interrupts in protected mode (x86)
                            
                                Inlining of virtual functions (Clang vs GCC)
                            
                                AVX2, How to Efficiently Load Four Integers to Even Indices of a 256 Bit Register and Copy to Odd Indices?
                            
                                Opposite of cache prefetch hint
                            
                                How can I determine what architectures gcc supports?
                            
                                How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)
                            
                                Does aligning memory on particular address boundaries in C/C++ still improve x86 performance?
                            
                                Why is POP slow when using register R12?
                            
                                Do x86/x64 chips still use microprogramming?
                            
                                How many byes is each instruction compiled to in x86 assembly?
                            
                                Using AVX instructions disables exp() optimization?
                            
                                Why makecontext does not work with pthreads
                            
                                How to calculate MIPS of my processor?
                            
                                x86 Can push/pop be less than 4 bytes? [duplicate]
                            
                                How to compile this program with inline asm?
                            
                                What is the difference between MOVDQA and MOVNTDQA, and VMOVDQA and VMOVNTDQ for WB/WC marked region?
                            
                                AVX2 VPSHUFB emulation in AVX
                            
                                What comes after QWORD?
                            
                                What does F in FWORD stand for?
                            
                                Creating a C function without compiler generated prologue/epilogue & RET instruction?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does a hyper-threaded core share MMU and TLB?

Tags:

cpu-architecture

x86

hyperthreading

tlb

mmu

Jonggyu Park

People also ask

1 Answers

Hadi Brais

Recent Activity

Donate For Us