Some CPU and compilers supply prefetch instructions. Eg: __builtin_prefetch in GCC Document. Although there is a comment in GCC's document, but it's too short to me.
I want to know, in practice, when should we use prefetch? Are there some examples?
The goal of prefetching is to make data available in the cache before the data consumer places its request, thereby masking the latency of the slower data source below the cache.
Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed (hence the term 'prefetch').
With Prefetch a process takes a performance hit at startup time as the files the process uses are pre-loaded into RAM, depending on the speed and performance of the disk this delay can be noticeable.
Proactively prefetching data brings the data into the cache before the actual requests occur. Passively caching data, on the other hand, only fetches the missed data from the backend storage after the requests arrive. There is a trade-off between prefetching and caching.
This question isn't really about compilers as they're just providing some hook to insert prefetch instructions into your assembly code / binary. Different compilers may provide different intrinsic formats but you can just ignore all these and (carefully) add it directly in assembly code.
Now the real question seems to be "when are prefetches useful", and the answer is - in any scenario where youre bounded on memory latency, and the access pattern isn't regular and distinguishable for the HW prefetch to capture (organized in a stream or strides), or when you suspect there are too many different streams for the HW to track simultaneously.
Most compilers would only very seldom insert their own prefetches for you, so it's basically up to you to play with your code and benchmark how prefetches could be useful.
The link by @Mysticial shows a nice example, but here's a more straight forward one that I think can't be caught by HW:
#include "stdio.h"
#include "sys/timeb.h"
#include "emmintrin.h"
#define N 4096
#define REP 200
#define ELEM int
int main() {
int i,j, k, b;
const int blksize = 64 / sizeof(ELEM);
ELEM __attribute ((aligned(4096))) a[N][N];
for (i = 0; i < N; ++i) {
for (j = 0; j < N; ++j) {
a[i][j] = 1;
}
}
unsigned long long int sum = 0;
struct timeb start, end;
unsigned long long delta;
ftime(&start);
for (k = 0; k < REP; ++k) {
for (i = 0; i < N; ++i) {
for (j = 0; j < N; j ++) {
sum += a[i][j];
}
}
}
ftime(&end);
delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
printf ("Prefetching off: N=%d, sum=%lld, time=%lld\n", N, sum, delta);
ftime(&start);
sum = 0;
for (k = 0; k < REP; ++k) {
for (i = 0; i < N; ++i) {
for (j = 0; j < N; j += blksize) {
for (b = 0; b < blksize; ++b) {
sum += a[i][j+b];
}
_mm_prefetch(&a[i+1][j], _MM_HINT_T2);
}
}
}
ftime(&end);
delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
printf ("Prefetching on: N=%d, sum=%lld, time=%lld\n", N, sum, delta);
}
What I do here is traverse each matrix line (enjoying the HW prefetcher help with the consecutive lines), but prefetch ahead the element with the same column index from the next line that resides in a different page (which the HW prefetch should be hard pressed to catch). I sum the data just so that it's not optimized away, the important thing is that I basically just loop over a matrix, should have been pretty straightforward and simple to detect, and yet still get a speedup.
Built with gcc 4.8.1 -O3, it gives me an almost 20% boost on an Intel Xeon X5670:
Prefetching off: N=4096, sum=3355443200, time=1839
Prefetching on: N=4096, sum=3355443200, time=1502
Note that the speedup is received even though I made the control flow more complicated (extra loop nesting level), the branch predictor should easily catch the pattern of that short block-size loop, and it saves execution of unneeded prefetches.
Note that Ivybridge and onward on should have a "next-page prefetcher", so the HW may be able to mitigate that on these CPUs (if anyone has one available and cares to try i'll be happy to know). In that case i'd modify the benchmark to sum every second line (and the prefetch would look ahead two lines everytime), that should confuse the hell out of the HW prefetchers.
Skylake results
Here are some results from a Skylake i7-6700-HQ, running at 2.6 GHz (no turbo) with gcc
:
Compile flags: -O3 -march=native
Prefetching off: N=4096, sum=28147495993344000, time=896
Prefetching on: N=4096, sum=28147495993344000, time=1222
Prefetching off: N=4096, sum=28147495993344000, time=886
Prefetching on: N=4096, sum=28147495993344000, time=1291
Prefetching off: N=4096, sum=28147495993344000, time=890
Prefetching on: N=4096, sum=28147495993344000, time=1234
Prefetching off: N=4096, sum=28147495993344000, time=848
Prefetching on: N=4096, sum=28147495993344000, time=1220
Prefetching off: N=4096, sum=28147495993344000, time=852
Prefetching on: N=4096, sum=28147495993344000, time=1253
Compile flags: -O2 -march=native
Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on: N=4096, sum=28147495993344000, time=1813
Prefetching off: N=4096, sum=28147495993344000, time=1956
Prefetching on: N=4096, sum=28147495993344000, time=1814
Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on: N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1961
Prefetching on: N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1965
Prefetching on: N=4096, sum=28147495993344000, time=1814
So using prefetch is either about 40% slower, or 8% faster depending on if you use -O3
or -O2
respectively for this particular example. The big slowdown for -O3
is actually due to a code generation quirk: at -O3
the loop without prefetch is vectorized, but the extra complexity of the prefetch variant loop prevents vectorization on my version of gcc anyway.
So the -O2
results are probably more apples-to-apples, and the benefit is about half (8% speedup vs 16%) of what we saw on Leeor's Westmere. Still it's worth noting that you have to be careful not to change code generation such that you get a big slowdown.
This test probably isn't ideal in that by going int
by int
implies a lot of CPU overhead rather than stressing the memory subsystem (that's why vectorization helped so much).
On recent Intel chips one reason you apparently might want to use prefetching is to avoid CPU power-saving features artificially limiting your achieved memory bandwidth. In this scenario, simple prefetching can as much as double your performance versus the same code without prefetching, but it depends entirely on the selected power management plan.
I ran a simplified version (code here)of the test in Leeor's answer, which stresses the memory subsystem a bit more (since that's where prefetch will help, hurt or do nothing). The original test stressed the CPU in parallel with the memory subsystem since it added together every int
on each cache line. Since typical memory read bandwidth is in the region of 15 GB/s, that's 3.75 billion integers per second, putting a pretty hard cap on the maximum speed (code that isn't vectorized will usually process 1 int
or less per cycle, so a 3.75 GHz CPU will be about equally CPU and memory bount).
First, I got results that seemed to show prefetching kicking butt on my i7-6700HQ (Skylake):
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=221, MiB/s=11583
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=221, MiB/s=11583
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=204, MiB/s=12549
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=200, MiB/s=12800
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=201, MiB/s=12736
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=197, MiB/s=12994
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305
Eyeballing the numbers, prefetch achieves something a bit above 16 GiB/s and without only about 12.5, so prefetch is increasing speed by about 30%. Right?
Not so fast. Remembering that the powersaving mode has all sorts of wonderful interactions on modern chips, I changed my Linux CPU governor to performance from the default of powersave1. Now I get:
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=155, MiB/s=16516
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=144, MiB/s=17777
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=144, MiB/s=17777
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=152, MiB/s=16842
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=159, MiB/s=16100
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=163, MiB/s=15705
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=161, MiB/s=15900
It's a total toss-up. Both with and without prefetching seem to perform identically. So either hardware prefetching is less aggressive in the high powersaving modes, or there is some other interaction with power saving that behaves differently with the explicit software prefetches.
In fact, the difference between prefetching and not is even more extreme if you change the benchark. The existing benchmark alternates between runs with prefetching on and off, and it turns out that this helped the "off" variant because the speed increase which occurs in the "on" test partly carries over to the subsequent off test2. If you run only the "off" test you get results around 9 GiB/s:
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=280, MiB/s=9142
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=277, MiB/s=9241
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=285, MiB/s=8982
... versus about 17 GiB/s for the prefetching version:
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=149, MiB/s=17181
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=148, MiB/s=17297
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=148, MiB/s=17297
So the prefetching version is almost twice as fast.
Let's take a look at what's going on with perf stat
, for both the **off* version:
Performance counter stats for './prefetch-test off':
2907.485684 task-clock (msec) # 1.000 CPUs utilized
3,197,503,204 cycles # 1.100 GHz
2,158,244,139 instructions # 0.67 insns per cycle
429,993,704 branches # 147.892 M/sec
10,956 branch-misses # 0.00% of all branches
... and the on version:
1502.321989 task-clock (msec) # 1.000 CPUs utilized
3,896,143,464 cycles # 2.593 GHz
2,576,880,294 instructions # 0.66 insns per cycle
429,853,720 branches # 286.126 M/sec
11,444 branch-misses # 0.00% of all branches
The difference is that the version with prefetching on consistently runs at the max non-turbo frequency of ~2.6 GHz (I have disabled turbo via an MSR). The version without prefetching, however, has decided to run at a much lower speed of 1.1 GHz. Such large CPU differences often also reflect a large difference in uncore frequency, which can explain the worse bandwdith.
Now we've seen this before, and it is probably an outcome of the Energy Efficient Turbo feature on recent Intel chips, which try to ramp down the CPU frequency when they determine a process is mostly memory bound, presumably since increased CPU core speed doesn't provide much benefit in those cases. As we can see here, this assumption isn't always true, but it isn't clear to me if the tradeoff is a bad one in general, or perhaps the heuristic only occasionally gets it wrong.
1 I'm running the intel_pstate
driver, which is the default for Intel chips on recent kernels which implements "hardware p-states", also known as "HWP". Command used: sudo cpupower -c 0,1,2,3 frequency-set -g performance
.
2 Conversely, the slowdown from the "off" test partly carries over into the "on" test, although the effect is less extreme, possibly because the powersaving "ramp up" behavior is faster than "ramp down".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With