Software prefetching across page boundary on x86

Tags:

My understanding is that hardware prefetching will never cross page boundaries. I'm wondering if a software prefetch has the same restriction i.e. can I use a software prefetch to avoid a future TLB miss. From searching around, it appears to be possible, but I couldn't find anything definitive in the documentation, so a reference would be good.

I'm specifically interested in Nehalem, Sandy Bridge and Westmere.

972

asked Feb 08 '13 22:02

jmetcalfe

2 Answers

According to Intel's Optimization Reference Manual, it depends on the processor. From section 7.4.3:

There are cases where a PREFETCH will not perform the data prefetch. These include:

PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss. This applies to Pentium 4 processors with CPUID signature corresponding to family 15, model 0, 1, or 2. PREFETCH resolves DTLB misses and fetches data on Pentium 4 processors with CPUID signature corresponding to family 15, model 3.

An access to the specified address that causes a fault/exception.

Software prefetching may or may not avoid TLB misses, depending on the processor. It will not fetch the data if it would cause a page fault.

If you want ensure you avoid TLB misses, you could do a dummy read to load the data instead of a prefetch instruction. This could cause a page fault to swap in a page, which could be either good or bad depending on your use case.

198

answered Sep 20 '22 01:09

ughoavgfhw

In modern processors (Nehalem, Sandy Bridge and Westmere) software prefetching does indeed trigger a TLB lookup.

From the Intel optimization guide: (section 7.3.3)

In older microarchitectures, PREFETCH causing a Data Translation Lookaside Buffer (DTLB) miss would be dropped. In processors based on Nehalem, Westmere, Sandy Bridge, and newer microar-chitectures, Intel Core 2 processors, and Intel Atom processors, PREFETCH causing a DTLB miss can be fetched across a page boundary.

answered Sep 22 '22 01:09

jleahy

Related questions
                            
                                Cannot access memory as SSE type on x86 but works fine on x64
                            
                                How to call cpuid instruction in a Mac framework?
                            
                                How to compile Android AOSP for x86
                            
                                How can I exchange the middle two 64 bits in a 256 bit AVX(YMM) register
                            
                                x86 hardware Interrupt is not working on qemu
                            
                                Is placing code and read-only data it uses right next to each other a good idea?
                            
                                How to determine platforms like ARM, MIPS and IA32?
                            
                                Why does using jmp prevent the Clang assembler from figuring out an absolute expression for .fill?
                            
                                Rearranging Order of Aligned Objects For Minimal Space Usage
                            
                                Any advantage of XOR AL,AL + MOVZX EAX, AL over XOR EAX,EAX?
                            
                                Long nop instructions in nasm
                            
                                The probability of selected EFLAGS bits
                            
                                Optimization of fenced memory stores on x86 CPU
                            
                                Linking two or more assembly files
                            
                                Assembly does reading a plane in mode x needs a different output to the VGA ports from writing?
                            
                                _mm_max_ss has different behavior between clang and gcc
                            
                                How to find the L3 cache index and NUMA node index for the current hardware thread
                            
                                How to measure x86 and x86-64 assembly commands execution time in processor cycles? [duplicate]
                            
                                Determining register values when using objdump
                            
                                what is the different of busy loop with Sleep(0) and pause instruction?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Software prefetching across page boundary on x86

Tags:

x86

prefetch

tlb

nehalem

jmetcalfe

People also ask

2 Answers

ughoavgfhw

jleahy

Recent Activity

Donate For Us