Has anyone had experience using prefetch instructions for the Core 2 Duo processor?
I've been using the (standard?) prefetch set (prefetchnta
, prefetcht1
, etc) with success for a series of P4 machines, but when running the code on a Core 2 Duo it seems that the prefetcht(i)
instructions do nothing, and that the prefetchnta
instruction is less effective.
My criteria for assessing performance is the timing results for a BLAS 1 vector-vector (axpy) operation, when the vector size is large enough for out-of-cache behaviour.
Have Intel introduced new prefetch instructions?
The last-level (L2) caches contain hardware stream prefetchers that are trained on streams of misses and software prefetches. If a hardware prefetcher detects a pattern in the misses it sees, it will begin prefetching future addresses in that pattern.
What Does Prefetching Mean? Prefetching is the loading of a resource before it is required to decrease the time waiting for that resource. Examples include instruction prefetching where a CPU caches data and instruction blocks before they are executed, or a web browser requesting copies of commonly accessed web pages.
The prefetch is that portion of a CPU that reads instructions from memory and presents those instructions to the rest of the CPU for execution.
From an Intel reference document on Intel 64 and IA-32 Architectures, check out page 163 and 77:
Pentium 4 and Intel Xeon processors based on Intel NetBurst microarchitecture introduced hardware prefetching in addition to software prefetching. The hardware prefetcher operates transparently to fetch data and instruction streams from memory without requiring programmer intervention. Subsequent microarchitectures continue to improve and add features to the hardware prefetching mechanisms. Earlier implementations of hardware prefetching mechanisms focus on prefetching data and instruction from memory to L2; more recent implementations provide additional features to prefetch data from L2 to L1. In Intel NetBurst microarchitecture, the hardware prefetcher can track 8 independent streams.
The Pentium M processor also provides a hardware prefetcher for data. It can track 12 separate streams in the forward direction and 4 streams in the backward direction. The processor’s PREFETCHNTA instruction also fetches 64-bytes into the firstlevel data cache without polluting the second-level cache.
Intel Core Solo and Intel Core Duo processors provide more advanced hardware prefetchers for data than Pentium M processors. Key differences are summarized in Table 2-10.
I don't know whether it might be an issue with your code, but consider that the cache line size (which determines the stride size for use with prefetch instructions) may vary between different processors. Therefore, if you use code which is optimised under the assumption of a different cache line size on a CPU where this assumption isn't met, it's bound to deteriorate performance.
This question here asked how to determine prefetch cache line size.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With