How can the L1, L2, L3 CPU caches be turned off on modern x86/amd64 chips?

Tags:

Every modern high-performance CPU of the x86/x86_64 architecture has some hierarchy of data caches: L1, L2, and sometimes L3 (and L4 in very rare cases), and data loaded from/to main RAM is cached in some of them.

Sometimes the programmer may want some data to not be cached in some or all cache levels (for example, when wanting to memset 16 GB of RAM and keep some data still in the cache): there are some non-temporal (NT) instructions for this like MOVNTDQA (https://stackoverflow.com/a/37092 http://lwn.net/Articles/255364/)

But is there a programmatic way (for some AMD or Intel CPU families like P3, P4, Core, Core i*, ...) to completely (but temporarily) turn off some or all levels of the cache, to change how every memory access instruction (globally or for some applications / regions of RAM) uses the memory hierarchy? For example: turn off L1, turn off L1 and L2? Or change every memory access type to "uncached" UC (CD+NW bits of CR0??? SDM vol3a pages 423 424, 425 and "Third-Level Cache Disable flag, bit 6 of the IA32_MISC_ENABLE MSR (Available only in processors based on Intel NetBurst microarchitecture) — Allows the L3 cache to be disabled and enabled, independently of the L1 and L2 caches.").

I think such action will help to protect data from cache side channel attacks/leaks like stealing AES keys, covert cache channels, Meltdown/Spectre. Although this disabling will have an enormous performance cost.

PS: I remember such a program posted many years ago on some technical news website, but can't find it now. It was just a Windows exe to write some magical values into an MSR and make every Windows program running after it very slow. The caches were turned off until reboot or until starting the program with the "undo" option.

482

asked Jan 20 '18 19:01

osgx

1 Answers

The Intel's manual 3A, Section 11.5.3, provides an algorithm to globally disable the caches:

11.5.3 Preventing Caching

To disable the L1, L2, and L3 caches after they have been enabled and have received cache fills, perform the following steps:

Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1 and the NW flag to 0.

Flush all caches using the WBINVD instruction.

Disable the MTRRs and set the default memory type to uncached or set all MTRRs for the uncached memory type (see the discussion of the discussion of the TYPE field and the E flag in Section 11.11.2.1, “IA32_MTRR_DEF_TYPE MSR”).

The caches must be flushed (step 2) after the CD flag is set to ensure system memory coherency. If the caches are not flushed, cache hits on reads will still occur and data will be read from valid cache lines.

The intent of the three separate steps listed above addresses three distinct requirements: (i) discontinue new data replacing existing data in the cache (ii) ensure data already in the cache are evicted to memory, (iii) ensure subsequent memory references observe UC memory type semantics. Different processor implementation of caching control hardware may allow some variation of software implementation of these three requirements. See note below.

NOTES Setting the CD flag in control register CR0 modifies the processor’s caching behaviour as indicated in Table 11-5, but setting the CD flag alone may not be sufficient across all processor families to force the effective memory type for all physical memory to be UC nor does it force strict memory ordering, due to hardware implementation variations across different processor families. To force the UC memory type and strict memory ordering on all of physical memory, it is sufficient to either program the MTRRs for all physical memory to be UC memory type or disable all MTRRs.

For the Pentium 4 and Intel Xeon processors, after the sequence of steps given above has been executed, the cache lines containing the code between the end of the WBINVD instruction and before the MTRRS have actually been disabled may be retained in the cache hierarchy. Here, to remove code from the cache completely, a second WBINVD instruction must be executed after the MTRRs have been disabled.

That's a long quote but it boils down to this code

;Step 1 - Enter no-fill mode
mov eax, cr0
or eax, 1<<30        ; Set bit CD
and eax, ~(1<<29)    ; Clear bit NW
mov cr0, eax

;Step 2 - Invalidate all the caches
wbinvd

;All memory accesses happen from/to memory now, but UC memory ordering may not be enforced still.  

;For Atom processors, we are done, UC semantic is automatically enforced.

xor eax, eax
xor edx, edx
mov ecx, IA32_MTRR_DEF_TYPE    ;MSR number is 2FFH
wrmsr

;P4 only, remove this code from the L1I
wbinvd

most of which is not executable from user mode.

AMD's manual 2 provides a similar algorithm in section 7.6.2

7.6.2 Cache Control Mechanisms
The AMD64 architecture provides a number of mechanisms for controlling the cacheability of memory. These are described in the following sections.

Cache Disable. Bit 30 of the CR0 register is the cache-disable bit, CR0.CD. Caching is enabled when CR0.CD is cleared to 0, and caching is disabled when CR0.CD is set to 1. When caching is disabled, reads and writes access main memory.

Software can disable the cache while the cache still holds valid data (or instructions). If a read or write hits the L1 data cache or the L2 cache when CR0.CD=1, the processor does the following:

Writes the cache line back if it is in the modified or owned state.

Invalidates the cache line.

Performs a non-cacheable main-memory access to read or write the data.

If an instruction fetch hits the L1 instruction cache when CR0.CD=1, some processor models may read the cached instructions rather than access main memory. When CR0.CD=1, the exact behavior of L2 and L3 caches is model-dependent, and may vary for different types of memory accesses.

The processor also responds to cache probes when CR0.CD=1. Probes that hit the cache cause the processor to perform Step 1. Step 2 (cache-line invalidation) is performed only if the probe is performed on behalf of a memory write or an exclusive read.

Writethrough Disable. Bit 29 of the CR0 register is the not writethrough disable bit, CR0.NW. In early x86 processors, CR0.NW is used to control cache writethrough behavior, and the combination of CR0.NW and CR0.CD determines the cache operating mode.

[...]

In implementations of the AMD64 architecture, CR0.NW is not used to qualify the cache operating mode established by CR0.CD.

This translates to this code (very similar to the Intel's one):

;Step 1 - Disable the caches
mov eax, cr0
or eax, 1<<30
mov cr0, eax

;For some models we need to invalidated the L1I
wbinvd

;Step 2 - Disable speculative accesses
xor eax, eax
xor edx, edx
mov ecx, MTRRdefType  ;MSR number is 2FFH
wrmsr

Caches can also be selectively disabled at:

Page level, with the attribute bits PCD (Page Cache Disable) [Only for Pentium Pro and Pentium II].
When both are clear the MTTR of relevance is used, if PCD is set the aching
Page level, with the PAT (Page Attribute Table) mechanism.
By filling the IA32_PAT with caching types and using the bits PAT, PCD, PWT as a 3-bit index it's possible to select one the six caching types (UC-, UC, WC, WT, WP, WB).
Using the MTTRs (fixed or variable).
By setting the caching type to UC or UC- for specific physical areas.

Of these options only the page attributes can be exposed to user mode programs (see for example this).

answered Sep 21 '22 06:09

Margaret Bloom

Related questions
                            
                                Assembly: REP MOVS mechanism
                            
                                Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2
                            
                                MSVC: Invalid memcpy optimization?
                            
                                How can I load values from memory without polluting the cache?
                            
                                How are MMIO, IO and PCI configuration request routed and handled by the OS in a NUMA system?
                            
                                64-bit registers under 32-bit windows
                            
                                Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all
                            
                                Why does GCC drop the frame pointer on 64-bit?
                            
                                How to produce a minimal BIOS hello world boot sector with GCC that works from a USB stick on real hardware?
                            
                                Is there any hope to call a common base class method on a std::variant efficiently?
                            
                                What happens when a mov instruction causes a page fault with interrupts disabled on x86?
                            
                                Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted?
                            
                                Is the TLB shared between multiple cores?
                            
                                Why does this difference in asm matter for performance (in an un-optimized ptr++ vs. ++ptr loop)?
                            
                                How many pipeline stages does the Intel Core i7 have? [duplicate]
                            
                                Running 32 bit assembly code on a 64 bit Linux & 64 bit Processor : Explain the anomaly
                            
                                CPU cache behaviour/policy for file-backed memory mappings?
                            
                                Shift a __m128i of n bits
                            
                                What is the difference between assembly on mac and assembly on linux?
                            
                                What is the stack engine in the Sandybridge microarchitecture?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can the L1, L2, L3 CPU caches be turned off on modern x86/amd64 chips?

Tags:

x86

cpu-cache

intel

memory-access

msr

osgx

People also ask

1 Answers

Margaret Bloom

Recent Activity

Donate For Us