Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do prefetch instructions need to return their result before they retire?

On recent Intel and AMD CPUs, can a prefetch instruction that has executed, but for which the requested line hasn't yet arrived in the designated cache level still retire?

That is, is retirement of prefetch "blocking" as it appears to be for loads, or is it non-blocking?

like image 313
BeeOnRope Avatar asked Sep 18 '18 00:09

BeeOnRope


People also ask

How does cpu prefetching work?

Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed (hence the term 'prefetch').

What is prefetch buffer in computer architecture?

In computer architecture, prefetching refers to the retrieving and storing of data into the buffer memory (cache) before the processor requires the data. When the processor wants to process the data, it is readily available and can be processed within a very short period of time.


1 Answers

Regarding Intel processors, no. This is mentioned in the Intel optimization manual Section 7.3.3:

PREFETCH can provide greater performance than preloading because:

  • Has no destination register, it only updates cache lines.
  • Does not complete its own execution if that would cause a fault.
  • Does not stall the normal instruction retirement.
  • Does not affect the functional behavior of the program.
  • Has no cache line split accesses.
  • Does not cause exceptions except when the LOCK prefix is used. The LOCK prefix is not a valid prefix for use with PREFETCH.
  • Does not complete its own execution if that would cause a fault.

The advantages of PREFETCH over preloading instructions are processor specific. This may change in the future.

In addition Section 3.7.1 says:

Software PREFETCH operations work the same way as do load from memory operations, with the following exceptions:

  • Software PREFETCH instructions retire after virtual to physical address translation is completed.
  • If an exception, such as page fault, is required to prefetch the data, then the software prefetch instruction retires without prefetching data.

I've verified both of these points experimentally on Haswell and Broadwell.

enter image description here

enter image description here

All miss TLB: All prefetch instructions miss all MMU and data caches, but the page is in main memory (no minor or major page faults).

All hit TLB: All prefetch instructions hit the L1 TLB and data cache.

Fault different pages: All prefetch instructions miss all MMU and data caches and the page descriptor results in a page fault. Each prefetch instruction accesses a different virtual page.

Fault same page: All prefetch instructions miss all MMU and data caches and the page descriptor results in a page fault. Each prefetch instruction accesses the same virtual page.

For the Broadwell graph, results for both PREFETCH0 and PREFETCHW are shown. PREFETCHW is not supported on Haswell. The frequency on Haswell and Broadwell was fixed to 3.4GHz and 1.7GHz, respectively, and I used the intel_pstate power scaling driver on both. All hardware prefetchers were turned on. Note that the latency of PREFETCHW on a page fault is independent of whether the target page is writeable. A read-only page results in a fault that has the same impact as a fault due to any other reason. Also, my experiments only consider the case where no core has a copy of the cache line.

The 1 cycle throughput is expected because of the 1c dependency chain:

loop:
prefetcht0 (%rax)
add    $0x1000,%rax 
cmp    %rbx,%rax
jne    loop

On Broadwell, the "fault same page" case seems to be slightly slower than the "fault different pages" case. This is in contrast to Haswell. I don't know why. This probably depends on the level of the paging structure that contains the invalid entry at which basically the page walker detects a page fault. This is OS-dependent.

I think the reason why prefetch instructions cannot immediately retire on a TLB miss is because the load unit does not have the post-retirement logic like the store unit. The idea here is that since most likely there will be a demand access to the page following the prefetch (which is presumable why the prefetch is there), there will be stall due to the TLB miss anyway either on the demand access or the prefetch. Perhaps stalling on the prefetch is better espcially when the instructions immediately following the prefetch don't access the same page.

In addition, I have verified experimentally that prefetch instructions can retire before the prefetching operation completes by placing an LFENCE after the prefetch instruction and observing that the time per prefetch instruction increases only slightly (the cost of the fence) compared to using a load instead of a prefetch.

Software prefetching instructions on Xeon Phi processors are executed the same way as on Haswell/Broadwell 1, but read also the section on Itanium below.

Section 7.3.3 also says:

There are cases where a PREFETCH will not perform the data prefetch. These include:

  • In older microarchitectures, PREFETCH causing a Data Translation Lookaside Buffer (DTLB) miss would be dropped. In processors based on Nehalem, Westmere, Sandy Bridge, and newer microarchitectures, Intel Core 2 processors, and Intel Atom processors, PREFETCH causing a DTLB miss can be fetched across a page boundary.
  • An access to the specified address that causes a fault/exception.
  • PREFETCH targets an uncacheable memory region (for example, USWC and UC).
  • If the memory subsystem runs out of request buffers between the first-level cache and the second-level cache.
  • The LOCK prefix is used. This causes an invalid opcode exception.

The second point has been verified experimentally on Haswell, Broadwell, and Skylake. My code is not capable of detecting the fourth point that states prefetch requests can be dropped when running out of LFBs. The AMD results seem to indicate that AMD also drops prefetch requests. But the time per access on AMD is still much smaller than that on Intel. I think AMD drop prefetch requests when the TLB fill buffers are full while Intel drops them when the L1D fill buffers are full. My code never makes the L1D fill buffers full, which explains the AMD vs. Intel results.

The first point says that on the Core2 and Atom microarchitectures and later, software prefetches are not dropped on a TLB miss. According to an older version of the optimization manual, Pentium 4 processors with model number 3 or larger also do not drop a software prefetch on a TLB miss. This may also be the case on the Intel Core microarchitecture and (some) Pentium M processors (I wasn't able to find an Intel source regarding these processors). Pentium III processors and Pentium 4 processors with model number smaller than 3 definitely drop software prefetches on a TLB miss. Processors prior to the Pentium III do not support software prefetching instructions.


Prefetch uops get dispatched to port 2 or 3 and allocated in load buffers. Prefetch uops to the same cache line don't get combined. That is, each uop will get its own load buffer. I think (but I have not verified experimentally) that ROB entries are allocated for prefetch uops. It's just that the ROB never stalls on prefetch uops, as long as they've been dispatched to a load port.

The prefetch request itself (sent to L1d or outer levels of cache) isn't something the prefetch uop has to wait for before being marked as complete in the ROB and ready to retire, unlike a regular load.


There is an interesting 2011 patent that discusses an enhancement to software prefetching on Itanium2 processors. It mentions that previous Itanium processors had to stall when a software prefetch missed the TLB because they were designed to not drop any software prefetch requests and later instructions could not proceed past it because they were in-order processors. The patent proposed a design that allows software prefetching requests to execute out-of-order with respect to later instructions without dropping them. This is done by adding a data prefetch queue (DPQ) which is used to queue up software prefetch requests that miss the TLB. A prefetch in the DPQ is then re-issued after the hardware page table walk completes. In addition, multiple hardware page table walkers are added to potentially allow later demand accesses to execute even if they miss the TLB. However, if the DPQ fills up with prefetch instructions, the pipeline stalls on the next prefetch instruction. Also according to the patent, software prefetch requests are not dropped even on page faults. This is in contrast to big cores and Xeon Phi. The patent also discusses the hardware prefetchers implemented in Itanium.

In out-of-order big core microarchitectures, the load buffer naturally plays the role of the DPQ. I don't know whether Xeon Phi has such a structure.


The AMD optimization manual Section 5.6 says the following:

The prefetch instructions can be affected by false dependencies on stores. If there is a store to an address that matches a request, that request (the prefetch instruction) may be blocked until the store is written to the cache. Therefore, code should prefetch data that is located at least 64 bytes away from any surrounding store’s data address.

I was curious enough to test this on Intel processors (on Haswell) by putting two prefetch instructions and one store instruction (followed by a dummy add rax, rax), and I've observed the following:

  • UOPS_RETIRED.STALL_CYCLES is significantly larger than core cycle count, which makes no sense.
  • The total number of uops dispatched to port 2 and 3 is about 16% higher than what is supposed to be. This indicates that prefetch uops are being replayed.
  • RESOURCE_STALLS.ANY reports basically no stalls. This is in contrast to the case where there are two prefetch instructions followed by two dummy ALU instructions (the pipeline stalls on the load buffers).

However, I've observed these effects only when the store is to the same 4K page as the prefetch instructions. If the store is to a different page, the code works similar to the one with the two dummy ALUs. So it seems that stores interact with prefetch instructions on Intel processors.


(1) But they interact differently with hardware prefetchers. However, this is a post-retirement effect.

(2) Itanium is a family of IA-64 processors, so it's not exactly relevant to the question.

like image 174
Hadi Brais Avatar answered Sep 22 '22 16:09

Hadi Brais