I am reading about different prefetcher available in Intel Core i7 system. I have performed experiments to understand when these prefetchers are invoked.
These are my findings
L1 IP prefetchers starts prefetching after 3 cache misses. It only prefetch on cache hit.
L2 Adjacent line prefetcher starts prefetching after 1st cache miss and prefetch on cache miss.
L2 H/W (stride) prefetcher starts prefetching after 1st cache miss and prefetch on cache hit.
I am not able to understand the behavior of DCU prefetcher. When it starts prefetching or invoked ? Does it prefetch next cache line on cache hit or miss ?
I have explored intel document disclosure-of-hw-prefetcher where it mentioned - DCU prefetcher fetches the next cache line into L1-D cache , but no clear information when it starts prefetching .
Can anyone explain when DCU prefetcher prefetch starts prefetching?
The DCU prefetcher does not prefetch lines in a deterministic manner. It appears to have a confidence value associated with each potential prefetch request. If the confidence is larger than some threshold only then is the prefetch triggered. Moreover, it seems that if both L1 prefetchers are enabled, only one of them can issue a prefetch request in the same cycle. Perhaps the prefetch from the one with higher confidence is accepted. The answer below does not take these observations into consideration. (A lot more experimentation work needs to be done. I will rewrite it in the future.)
The Intel manual tells us a few things about the DCU prefetcher. Section 2.4.5.4 and Section 2.5.4.2 of the optimization manual both say the following:
Data cache unit (DCU) prefetcher -- This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.
Note that Section 2.4.5.4 is part of the section on Sandy Bridge and Section 2.5.4.2 is part of the section on Intel Core. The DCU prefetcher was first supported on the Intel Core microarchitecture and it's also supported on all later microarchitectures. There is no indication as far as I know that the DCU prefetcher have changed over time. So I think it works exactly the same on all microarchitectures up to Skylake at least.
That quote doesn't really say much. The "ascending access" part suggests that the prefetcher is triggered by multiple accesses with increasing offsets. The "recently loaded data" part is vague. It may refer to one or more lines that immediately precede the line to be prefetched in the address space. It's also not clear whether that refers to virtual or physical addresses. The "fetches the next line" part suggests that it fetches only a single line every time it's triggered and that line is the line that succeeds the line(s) that triggered the prefetch.
I've conducted some experiments on Haswell with all prefetchers disabled except for the DCU prefetcher. I've also disabled hyperthreading. This enables me to study the DCU prefetcher in isolation. The results show the following:
prefetchnta
) or a combination of both. The accesses can be either hits or misses in the L1D or a combination of both. When it's triggered, for the 4 pages that are currently being tracked, it will prefetch the immediate next line within each of the respective pages. For example, consider the following three demand load misses: 0xF1000, 0xF2008, and 0xF3004. Assume that the 4 pages being tracked are 0xF1000, 0xF2000, 0xF3000, and 0xF4000. Then the DCU prefetcher will prefetch the following lines: 0xF1040, 0xF2040, 0xF3040, and 0xF4040.So the accesses that trigger the prefetcher don't have to be "ascending" or follow any order. The cache line offset itself seems to be ignored by the prefetcher. Only the physical page number matters.
I think the DCU prefetcher has a fully associative buffer that contains 4 entries. Each entry is tagged with the (probably physical) page number and has a valid bit to indicate whether the entry contains a valid page number. In addition, each cache set of the L1D is associated with a 2-bit saturating counter that is incremented whenever a demand load or a software prefetch request accesses the corresponding cache set and the dirty flag of the accessed page is not set. When the counter reaches a value of 3, the prefetcher is triggered. The prefetcher already has the physical page numbers from which it needs to prefetch; it can obtain them from the buffer entry that corresponds to the counter. So it can immediately issue prefetch requests to the next cache lines for each of the pages being tracked by the buffer. However, if a fill buffer is not available for a triggered prefetch request, the prefetch will be dropped. Then the counter will be reset to zero. Page tables might be modified though. It's possible that the prefetcher flushes its buffer whenever the TLB is flushed.
It could be the case that there are two DCU prefetchers, one for each logical core. When hyperthreading is disabled, one of the prefetchers would be disabled too. It could also be the case the 4 buffer entries that contain the page numbers are statically partitioned between the two logical cores and combined when hyperthreading is disabled. I don't know for sure, but such design makes sense to me. Another possible design would be each prefetcher has a dedicated 4-entry buffer. It's not hard to determine how the DCU prefetcher works when hyperthreading is enabled. I just didn't spend the effort to study it.
All in all, the DCU pefetcher is by far the simplest among the 4 data prefetchers that are available in modern high-performance Intel processors. It seems that it's only effective when sequentially, but slowly, accessing small chunks of read-only data (such as read-only files and statically initialized global arrays) or accessing multiple read-only objects at the same time that may contain many small fields and span a few consecutive cache lines within the same page.
Section 2.4.5.4 also provides additional information on L1D prefetching in general, so it applies to the DCU prefetcher.
Data prefetching is triggered by load operations when the following conditions are met:
- Load is from writeback memory type.
This means that the DCU prefetcher will not track accesses to the WP and WT cacheable memory types.
- The prefetched data is within the same 4K byte page as the load instruction that triggered it.
This has been verified experimentally.
- No fence is in progress in the pipeline.
I don't know what this means. See: https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/805373.
- Not many other load misses are in progress.
There are only 10 fill buffers that can hold requests that missed the L1D. This raises the question though that if there was only a single available fill buffer, would the hardware prefetcher use it or leave it for anticipated demand accesses? I don't know.
- There is not a continuous stream of stores.
This suggests that if there is a stream of a large number of stores intertwined with few loads, the L1 prefetcher will ignore the loads and basically temporarily switch off until the stores become a minority. However, my experimental results show that even a single store to a page will turn the prefetcher off for that page.
All Intel Atom microarchitectures have the DCU prefetcher. Although the prefetcher might track less than 4 pages in these microarchitectures.
All Xeon Phi microarchitectures up to and including Knights Landing don't have the DCU prefetcher. I don't know about later Xeon Phi microarchitectures.
AFAIK, Intel CPUs don't have an L1 adjacent-line prefetcher.
It has one in L2, though, which tries to complete a 128-byte aligned pair of 64-byte cache lines. (So it's not necessarily next, it could be the previous line if the demand-miss or other prefetch that caused one line to be cached was for the high half of a pair.)
See also https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/714832, and the many "related" links here on SO, e.g. prefetching data at L1 and L2. Not sure if either of those have any more details than the prefetch section of Intel's optimization manual, though: https://software.intel.com/en-us/articles/intel-sdm#optimization
I'm not sure if it has any heuristic to avoid wasting bandwidth and cache footprint when only one of a pair of lines is needed, other than not prefetching when there are enough demand misses outstanding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With