Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prefetch for Intel Core 2 Duo

Has anyone had experience using prefetch instructions for the Core 2 Duo processor?

I've been using the (standard?) prefetch set (prefetchnta, prefetcht1, etc) with success for a series of P4 machines, but when running the code on a Core 2 Duo it seems that the prefetcht(i) instructions do nothing, and that the prefetchnta instruction is less effective.

My criteria for assessing performance is the timing results for a BLAS 1 vector-vector (axpy) operation, when the vector size is large enough for out-of-cache behaviour.

Have Intel introduced new prefetch instructions?

like image 216
Darren Engwirda Avatar asked Nov 16 '09 13:11

Darren Engwirda


People also ask

What is L2 cache prefetching?

The last-level (L2) caches contain hardware stream prefetchers that are trained on streams of misses and software prefetches. If a hardware prefetcher detects a pattern in the misses it sees, it will begin prefetching future addresses in that pattern.

What does prefetch unit do?

What Does Prefetching Mean? Prefetching is the loading of a resource before it is required to decrease the time waiting for that resource. Examples include instruction prefetching where a CPU caches data and instruction blocks before they are executed, or a web browser requesting copies of commonly accessed web pages.

What is CPU prefetch unit?

The prefetch is that portion of a CPU that reads instructions from memory and presents those instructions to the rest of the CPU for execution.


2 Answers

From an Intel reference document on Intel 64 and IA-32 Architectures, check out page 163 and 77:

Pentium 4 and Intel Xeon processors based on Intel NetBurst microarchitecture introduced hardware prefetching in addition to software prefetching. The hardware prefetcher operates transparently to fetch data and instruction streams from memory without requiring programmer intervention. Subsequent microarchitectures continue to improve and add features to the hardware prefetching mechanisms. Earlier implementations of hardware prefetching mechanisms focus on prefetching data and instruction from memory to L2; more recent implementations provide additional features to prefetch data from L2 to L1. In Intel NetBurst microarchitecture, the hardware prefetcher can track 8 independent streams.

The Pentium M processor also provides a hardware prefetcher for data. It can track 12 separate streams in the forward direction and 4 streams in the backward direction. The processor’s PREFETCHNTA instruction also fetches 64-bytes into the firstlevel data cache without polluting the second-level cache.

Intel Core Solo and Intel Core Duo processors provide more advanced hardware prefetchers for data than Pentium M processors. Key differences are summarized in Table 2-10.

like image 155
Yannick Motton Avatar answered Oct 17 '22 21:10

Yannick Motton


I don't know whether it might be an issue with your code, but consider that the cache line size (which determines the stride size for use with prefetch instructions) may vary between different processors. Therefore, if you use code which is optimised under the assumption of a different cache line size on a CPU where this assumption isn't met, it's bound to deteriorate performance.

This question here asked how to determine prefetch cache line size.

like image 45
PhiS Avatar answered Oct 17 '22 20:10

PhiS