Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How MTRR registers implemented? [closed]

x86/x86-64 exposes MTRR (Memory-type-range-register) that can be useful to designate different portions of physical address space for different usages (e.g., Cacheable, Unchangeable, Writecombining, etc.).

My question is is anybody knows how these constrained on physical address space as defined by the MTRRs are enforced in hardware? On each memory access does the hardware check whether the physical address falls in a given range before the process decides whether it should look up the cache or lookup the writecombining buffer or send it to memory controller directly?

Thanks

like image 381
Arka Avatar asked Jan 15 '23 02:01

Arka


1 Answers

Wikipedia says in the article MTRR that:

Newer (primarily 64-bit) x86 CPUs support a more advanced technique called Page Attribute Tables that allow for per-table setting of these modes, instead of having a limited number of low-granularity registers

So, for newer x86/x86_64 CPUs it is possible to say that MTRR may be implemented as additional technique to PAT (Page Attribute Tables). The place where PAT is stored in memory is the Page Table (some bits in Page Table Entry, or PTE) and in the CPU they are stored (cached) in the TLB table (it is part of MMU). TLB (and MMU) is already the place which is visited by every memory access. I think, it may be good place to control type of memory, even with MTRR(?)

But what if I stop guessing and will open the RTFM book? There is one very good book about x86 world: The Unabridged Pentium 4: IA32 Processor Genealogy (ISBN-13: 978-0321246561). Part 7, chapter 24 "Pentium Pro software enchancement", part "MTRR added".

There are long rules for every mtrr memory type at pages 582-584, but rules for all 5 types (Uncacheable=UC, Write-Combining=WC, Write-Through=WT, Write-Protect=WP, Write-Back=WB) begins with: "Cache lookups are performed".

And in Part 9 "Pentium III" chapter 32 "Pentium III Xeon" the book clearly says:

When it has to perform a memory access, the processor consults both the MTRRs and the selected PTE or PDE to determine the memory type (and therefore the rules of conduct it is to follow).

But from other side... WRMSR into MTRR regs will invalidate TLB (according to intel instruction manual "instruct32.chm"):

When the WRMSR instruction is used to write to an MTRR, the TLBs are invalidated, including the global entries (see "Translation Lookaside Buffers (TLBs)" in Chapter 3 of the IA-32 Intel(R) Architecture Software Developer's Manual, Volume 3).

And there is one more direct hint in "Intel 64 and IA-32 Architectures Software developer manual, vol 3a", section "10.11.9 Large page considerations":

The MTRRs provide memory typing for a limited number of regions that have a 4 KByte granularity (the same granularity as 4-KByte pages). The memory type for a given page is cached in the processor’s TLBs.

You asked:

On each memory access does the hardware check whether the physical address falls in a given range

No. Every memory access is not compared with all MTRRs. All MTRRs ranges are precombined with PTEs bits of memory when PTE is loaded into TLB. Then the only place to check memory type will be TLB line. And the TLB IS checked for every memory access.

whether it should look up the cache or lookup the writecombining buffer or send it to memory controller directly

No, there is something that we don't understand clearly. Cache looked for every access, even for UC (e.g if region is just changed to UC there can be cached copy which should be evicted).

From chapter 24 (it is about Pentium 4):

Loads from Cacheable Memory The types of memory that the processor is permitted to cache from are WP, WT and WB memory (as defined by the MTRRs and the PTE or PDE).

When the core dispatches a load mop, the mop is placed in the Load Buffer that was reserved for it in the Allocator stage. The memory data read request is then issued to the L1 Data Cache for fulfillment:

  1. If the cache has a copy of the line that contains the requested read data, the read data is placed in the Load Buffer.
  2. If the cache lookup results in a miss, the request is forwarded upstream to the L2 Cache.
  3. If the L2 Cache has a copy of the sector that contains the requested read data, the read data is immediately placed in the Load Buffer and the sector is copied into the L1 Data Cache.
  4. If the cache lookup results in a miss, the request is forwarded upstream to either the L3 Cache (if there is one) or to the FSB Interface Unit.
  5. If the L3 Cache has a copy of the sector that contains the requested read data, the read data is immediately placed in the Load Buffer and the sector is copied into the L2 Cache and the L1 Data Cache.
  6. If the lookup in the top-level cache results in a miss, the request is forwarded to the FSB Interface Unit.
  7. When the sector is returned from memory, the read data is immediately placed in the Load Buffer and the sector is copied into the L3 Cache (if there is one), the L2 Cache, and the L1 Data Cache.

The processor core is permitted to speculatively execute loads that read data from WC, WP, WT or WB memory space

Loads from Uncacheable Memory The uncacheable memory types are UC and WC (as defined by the MTRRs and the PTE or PDE).

When the core dispatches a load mop, the read request is placed in the Load Buffer that was reserved for it in the Allocator stage. The memory data read request is submitted to the processor's caches as well. In the event of cache hit, the cache line is evicted from the cache. The request is issued to the FSB Interface Unit. A Memory Data Read transaction is performed on the FSB to fetch just the requested bytes from memory. When the data is returned from memory, the read data is immediately placed in the Load Buffer.

The processor core is not permitted to speculatively execute loads that read data from UC memory space

Stores to UC Memory UC is one of the two uncacheable memory types (the other is the WC memory type). When a store to UC memory is executed, it is posted in the Store Buffer reserved for it in the Allocator stage. Stores to UC memory are also submitted to the L1 Data Cache, the L2 Cache, or the L3 Cache (if there is one). In the event of a cache hit, the line is evicted from the cache.

When a Store Buffer containing a store to UC memory is forwarded to the FSB Interface Unit, a Memory Data Write transaction ... is performed on the FSB

Stores to WC Memory The WC memory type is well-suited to an area of memory (e.g., the video frame buffer) that has the following characteristics:

  • The processor does not cache from WC memory.
  • Speculative execution of loads from WC memory is permitted.
  • Stores to WC memory are deposited in the processor's Write Combining Buffers (WCBs).
  • Each WCB can hold one line (64 bytes of data).
  • As stores are performed to a line of WC memory space, the bytes are accumulated in the WCB assigned to record writes to that line of memory space.
  • A subsequent store to a location in a WCB can overwrite a byte that was deposited in that location by an earlier store to that location. In other words, multiple writes to the same location are collapsed so that the location reflects the last data byte written to that location.
  • When the WCBs are ultimately dumped to external memory over the FSB, data is not necessarily written to memory in the same order in which the earlier programmatic stores were executed. The device being written to must tolerate this type of behavior (i.e., it must function correctly). See "WCB FSB Transactions" on page 1080 for more information.

Stores to WT Memory

When a store to cacheable, Write-Through memory is executed. The store is posted in the Store Buffer that was reserved for its use in the Allocator stage. In addition, the store is submitted to the L1 Data Cache for a lookup. There are several possibilities: * If the store hits on the Data Cache, the line in the cache is updated, but it remains in the S state (which means the line is valid). * If the store misses the Data Cache, it is forwarded to the L2 Cache and a lookup is performed: * - If it hits on a line in the L2 Cache, the line is updated, but it remains in the S state (which means the line is valid). * - If it misses on the L2 Cache and there is no L3 Cache, no further action is taken.

like image 71
osgx Avatar answered Jan 29 '23 08:01

osgx