Some CPU and compilers supply prefetch instructions. Eg: __builtin_prefetch in GCC Document. Although there is a comment in GCC's document, but it's too short to me. I want to know, in practice, when should we use prefetch? Are there some examples?

This question isn't really about compilers as they're just providing some hook to insert prefetch instructions into your assembly code / binary. Different compilers may provide different intrinsic formats but you can just ignore all these and (carefully) add it directly in assembly code. Now the real question seems to be "when are prefetches useful", and the answer is - in any scenario where youre bounded on memory latency, and the access pattern isn't regular and distinguishable for the HW prefetch to capture (organized in a stream or strides), or when you suspect there are too many different streams for the HW to track simultaneously. Most compilers would only very seldom insert their own prefetches for you, so it's basically up to you to play with your code and benchmark how prefetches could be useful. The link by @Mysticial shows a nice example, but here's a more straight forward one that I think can't be caught by HW: <pre class="prettyprint"><code>#include "stdio.h" #include "sys/timeb.h" #include "emmintrin.h" #define N 4096 #define REP 200 #define ELEM int int main() { int i,j, k, b; const int blksize = 64 / sizeof(ELEM); ELEM __attribute ((aligned(4096))) a[N][N]; for (i = 0; i < N; ++i) { for (j = 0; j < N; ++j) { a[i][j] = 1; } } unsigned long long int sum = 0; struct timeb start, end; unsigned long long delta; ftime(&start); for (k = 0; k < REP; ++k) { for (i = 0; i < N; ++i) { for (j = 0; j < N; j ++) { sum += a[i][j]; } } } ftime(&end); delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm); printf ("Prefetching off: N=%d, sum=%lld, time=%lld\n", N, sum, delta); ftime(&start); sum = 0; for (k = 0; k < REP; ++k) { for (i = 0; i < N; ++i) { for (j = 0; j < N; j += blksize) { for (b = 0; b < blksize; ++b) { sum += a[i][j+b]; } _mm_prefetch(&a[i+1][j], _MM_HINT_T2); } } } ftime(&end); delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm); printf ("Prefetching on: N=%d, sum=%lld, time=%lld\n", N, sum, delta); } </code></pre> What I do here is traverse each matrix line (enjoying the HW prefetcher help with the consecutive lines), but prefetch ahead the element with the same column index from the next line that resides in a different page (which the HW prefetch should be hard pressed to catch). I sum the data just so that it's not optimized away, the important thing is that I basically just loop over a matrix, should have been pretty straightforward and simple to detect, and yet still get a speedup. Built with gcc 4.8.1 -O3, it gives me an almost 20% boost on an Intel Xeon X5670: <pre class="prettyprint"><code>Prefetching off: N=4096, sum=3355443200, time=1839 Prefetching on: N=4096, sum=3355443200, time=1502 </code></pre> Note that the speedup is received even though I made the control flow more complicated (extra loop nesting level), the branch predictor should easily catch the pattern of that short block-size loop, and it saves execution of unneeded prefetches. Note that Ivybridge and onward on should have a "next-page prefetcher", so the HW may be able to mitigate that on these CPUs (if anyone has one available and cares to try i'll be happy to know). In that case i'd modify the benchmark to sum every second line (and the prefetch would look ahead two lines everytime), that should confuse the hell out of the HW prefetchers. Skylake results Here are some results from a Skylake i7-6700-HQ, running at 2.6 GHz (no turbo) with <code>gcc</code>: Compile flags: <code>-O3 -march=native</code> <pre class="prettyprint"><code>Prefetching off: N=4096, sum=28147495993344000, time=896 Prefetching on: N=4096, sum=28147495993344000, time=1222 Prefetching off: N=4096, sum=28147495993344000, time=886 Prefetching on: N=4096, sum=28147495993344000, time=1291 Prefetching off: N=4096, sum=28147495993344000, time=890 Prefetching on: N=4096, sum=28147495993344000, time=1234 Prefetching off: N=4096, sum=28147495993344000, time=848 Prefetching on: N=4096, sum=28147495993344000, time=1220 Prefetching off: N=4096, sum=28147495993344000, time=852 Prefetching on: N=4096, sum=28147495993344000, time=1253 </code></pre> Compile flags: <code>-O2 -march=native</code> <pre class="prettyprint"><code>Prefetching off: N=4096, sum=28147495993344000, time=1955 Prefetching on: N=4096, sum=28147495993344000, time=1813 Prefetching off: N=4096, sum=28147495993344000, time=1956 Prefetching on: N=4096, sum=28147495993344000, time=1814 Prefetching off: N=4096, sum=28147495993344000, time=1955 Prefetching on: N=4096, sum=28147495993344000, time=1811 Prefetching off: N=4096, sum=28147495993344000, time=1961 Prefetching on: N=4096, sum=28147495993344000, time=1811 Prefetching off: N=4096, sum=28147495993344000, time=1965 Prefetching on: N=4096, sum=28147495993344000, time=1814 </code></pre> So using prefetch is either about 40% slower, or 8% faster depending on if you use <code>-O3</code> or <code>-O2</code> respectively for this particular example. The big slowdown for <code>-O3</code> is actually due to a code generation quirk: at <code>-O3</code> the loop without prefetch is vectorized, but the extra complexity of the prefetch variant loop prevents vectorization on my version of gcc anyway. So the <code>-O2</code> results are probably more apples-to-apples, and the benefit is about half (8% speedup vs 16%) of what we saw on Leeor's Westmere. Still it's worth noting that you have to be careful not to change code generation such that you get a big slowdown. This test probably isn't ideal in that by going <code>int</code> by <code>int</code> implies a lot of CPU overhead rather than stressing the memory subsystem (that's why vectorization helped so much). <hr>

On recent Intel chips one reason you apparently might want to use prefetching is to avoid CPU power-saving features artificially limiting your achieved memory bandwidth. In this scenario, simple prefetching can as much as double your performance versus the same code without prefetching, but it depends entirely on the selected power management plan. I ran a simplified version (code here)of the test in Leeor's answer, which stresses the memory subsystem a bit more (since that's where prefetch will help, hurt or do nothing). The original test stressed the CPU in parallel with the memory subsystem since it added together every <code>int</code> on each cache line. Since typical memory read bandwidth is in the region of 15 GB/s, that's 3.75 billion integers per second, putting a pretty hard cap on the maximum speed (code that isn't vectorized will usually process 1 <code>int</code> or less per cycle, so a 3.75 GHz CPU will be about equally CPU and memory bount). First, I got results that seemed to show prefetching kicking butt on my i7-6700HQ (Skylake): <pre class="prettyprint"><code>Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=221, MiB/s=11583 Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732 Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=221, MiB/s=11583 Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000 Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=204, MiB/s=12549 Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000 Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=200, MiB/s=12800 Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000 Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=201, MiB/s=12736 Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305 Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=197, MiB/s=12994 Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305 </code></pre> Eyeballing the numbers, prefetch achieves something a bit above 16 GiB/s and without only about 12.5, so prefetch is increasing speed by about 30%. Right? Not so fast. Remembering that the powersaving mode has all sorts of wonderful interactions on modern chips, I changed my Linux CPU governor to performance from the default of powersave1. Now I get: <pre class="prettyprint"><code>Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=155, MiB/s=16516 Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305 Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732 Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=144, MiB/s=17777 Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=144, MiB/s=17777 Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732 Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=152, MiB/s=16842 Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732 Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732 Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=159, MiB/s=16100 Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=163, MiB/s=15705 Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=161, MiB/s=15900 </code></pre> It's a total toss-up. Both with and without prefetching seem to perform identically. So either hardware prefetching is less aggressive in the high powersaving modes, or there is some other interaction with power saving that behaves differently with the explicit software prefetches. <h3>Investigation</h3> In fact, the difference between prefetching and not is even more extreme if you change the benchark. The existing benchmark alternates between runs with prefetching on and off, and it turns out that this helped the "off" variant because the speed increase which occurs in the "on" test partly carries over to the subsequent off test2. If you run only the "off" test you get results around 9 GiB/s: <pre class="prettyprint"><code>Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=280, MiB/s=9142 Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=277, MiB/s=9241 Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=285, MiB/s=8982 </code></pre> ... versus about 17 GiB/s for the prefetching version: <pre class="prettyprint"><code>Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=149, MiB/s=17181 Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=148, MiB/s=17297 Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=148, MiB/s=17297 </code></pre> So the prefetching version is almost twice as fast. Let's take a look at what's going on with <code>perf stat</code>, for both the **off* version: Performance counter stats for './prefetch-test off': <pre class="prettyprint"><code> 2907.485684 task-clock (msec) # 1.000 CPUs utilized 3,197,503,204 cycles # 1.100 GHz 2,158,244,139 instructions # 0.67 insns per cycle 429,993,704 branches # 147.892 M/sec 10,956 branch-misses # 0.00% of all branches </code></pre> ... and the on version: <pre class="prettyprint"><code> 1502.321989 task-clock (msec) # 1.000 CPUs utilized 3,896,143,464 cycles # 2.593 GHz 2,576,880,294 instructions # 0.66 insns per cycle 429,853,720 branches # 286.126 M/sec 11,444 branch-misses # 0.00% of all branches </code></pre> The difference is that the version with prefetching on consistently runs at the max non-turbo frequency of ~2.6 GHz (I have disabled turbo via an MSR). The version without prefetching, however, has decided to run at a much lower speed of 1.1 GHz. Such large CPU differences often also reflect a large difference in uncore frequency, which can explain the worse bandwdith. Now we've seen this before, and it is probably an outcome of the Energy Efficient Turbo feature on recent Intel chips, which try to ramp down the CPU frequency when they determine a process is mostly memory bound, presumably since increased CPU core speed doesn't provide much benefit in those cases. As we can see here, this assumption isn't always true, but it isn't clear to me if the tradeoff is a bad one in general, or perhaps the heuristic only occasionally gets it wrong. <hr> 1 I'm running the <code>intel_pstate</code> driver, which is the default for Intel chips on recent kernels which implements "hardware p-states", also known as "HWP". Command used: <code>sudo cpupower -c 0,1,2,3 frequency-set -g performance</code>. 2 Conversely, the slowdown from the "off" test partly carries over into the "on" test, although the effect is less extreme, possibly because the powersaving "ramp up" behavior is faster than "ramp down".

When should we use prefetch?

2 Answers

This question isn't really about compilers as they're just providing some hook to insert prefetch instructions into your assembly code / binary. Different compilers may provide different intrinsic formats but you can just ignore all these and (carefully) add it directly in assembly code.

Now the real question seems to be "when are prefetches useful", and the answer is - in any scenario where youre bounded on memory latency, and the access pattern isn't regular and distinguishable for the HW prefetch to capture (organized in a stream or strides), or when you suspect there are too many different streams for the HW to track simultaneously.
Most compilers would only very seldom insert their own prefetches for you, so it's basically up to you to play with your code and benchmark how prefetches could be useful.

The link by @Mysticial shows a nice example, but here's a more straight forward one that I think can't be caught by HW:

#include "stdio.h"
#include "sys/timeb.h"
#include "emmintrin.h"

#define N 4096
#define REP 200
#define ELEM int

int main() {
    int i,j, k, b;
    const int blksize = 64 / sizeof(ELEM);
    ELEM __attribute ((aligned(4096))) a[N][N];
    for (i = 0; i < N; ++i) {
        for (j = 0; j < N; ++j) {
            a[i][j] = 1;
        }
    }
    unsigned long long int sum = 0;
    struct timeb start, end;
    unsigned long long delta;

    ftime(&start);
    for (k = 0; k < REP; ++k) {
        for (i = 0; i < N; ++i) {
            for (j = 0; j < N; j ++) {
                sum += a[i][j];
            }
        }
    }
    ftime(&end);
    delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
    printf ("Prefetching off: N=%d, sum=%lld, time=%lld\n", N, sum, delta); 

    ftime(&start);
    sum = 0;
    for (k = 0; k < REP; ++k) {
        for (i = 0; i < N; ++i) {
            for (j = 0; j < N; j += blksize) {
                for (b = 0; b < blksize; ++b) {
                    sum += a[i][j+b];
                }
                _mm_prefetch(&a[i+1][j], _MM_HINT_T2);
            }
        }
    }
    ftime(&end);
    delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
    printf ("Prefetching on:  N=%d, sum=%lld, time=%lld\n", N, sum, delta); 
}

What I do here is traverse each matrix line (enjoying the HW prefetcher help with the consecutive lines), but prefetch ahead the element with the same column index from the next line that resides in a different page (which the HW prefetch should be hard pressed to catch). I sum the data just so that it's not optimized away, the important thing is that I basically just loop over a matrix, should have been pretty straightforward and simple to detect, and yet still get a speedup.

Built with gcc 4.8.1 -O3, it gives me an almost 20% boost on an Intel Xeon X5670:

Prefetching off: N=4096, sum=3355443200, time=1839
Prefetching on:  N=4096, sum=3355443200, time=1502

Note that the speedup is received even though I made the control flow more complicated (extra loop nesting level), the branch predictor should easily catch the pattern of that short block-size loop, and it saves execution of unneeded prefetches.

Note that Ivybridge and onward on should have a "next-page prefetcher", so the HW may be able to mitigate that on these CPUs (if anyone has one available and cares to try i'll be happy to know). In that case i'd modify the benchmark to sum every second line (and the prefetch would look ahead two lines everytime), that should confuse the hell out of the HW prefetchers.

Skylake results

Here are some results from a Skylake i7-6700-HQ, running at 2.6 GHz (no turbo) with gcc:

Compile flags: -O3 -march=native

Prefetching off: N=4096, sum=28147495993344000, time=896
Prefetching on:  N=4096, sum=28147495993344000, time=1222
Prefetching off: N=4096, sum=28147495993344000, time=886
Prefetching on:  N=4096, sum=28147495993344000, time=1291
Prefetching off: N=4096, sum=28147495993344000, time=890
Prefetching on:  N=4096, sum=28147495993344000, time=1234
Prefetching off: N=4096, sum=28147495993344000, time=848
Prefetching on:  N=4096, sum=28147495993344000, time=1220
Prefetching off: N=4096, sum=28147495993344000, time=852
Prefetching on:  N=4096, sum=28147495993344000, time=1253

Compile flags: -O2 -march=native

Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on:  N=4096, sum=28147495993344000, time=1813
Prefetching off: N=4096, sum=28147495993344000, time=1956
Prefetching on:  N=4096, sum=28147495993344000, time=1814
Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on:  N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1961
Prefetching on:  N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1965
Prefetching on:  N=4096, sum=28147495993344000, time=1814

So using prefetch is either about 40% slower, or 8% faster depending on if you use -O3 or -O2 respectively for this particular example. The big slowdown for -O3 is actually due to a code generation quirk: at -O3 the loop without prefetch is vectorized, but the extra complexity of the prefetch variant loop prevents vectorization on my version of gcc anyway.

So the -O2 results are probably more apples-to-apples, and the benefit is about half (8% speedup vs 16%) of what we saw on Leeor's Westmere. Still it's worth noting that you have to be careful not to change code generation such that you get a big slowdown.

This test probably isn't ideal in that by going int by int implies a lot of CPU overhead rather than stressing the memory subsystem (that's why vectorization helped so much).

answered Oct 27 '22 13:10

Leeor

On recent Intel chips one reason you apparently might want to use prefetching is to avoid CPU power-saving features artificially limiting your achieved memory bandwidth. In this scenario, simple prefetching can as much as double your performance versus the same code without prefetching, but it depends entirely on the selected power management plan.

I ran a simplified version (code here)of the test in Leeor's answer, which stresses the memory subsystem a bit more (since that's where prefetch will help, hurt or do nothing). The original test stressed the CPU in parallel with the memory subsystem since it added together every int on each cache line. Since typical memory read bandwidth is in the region of 15 GB/s, that's 3.75 billion integers per second, putting a pretty hard cap on the maximum speed (code that isn't vectorized will usually process 1 int or less per cycle, so a 3.75 GHz CPU will be about equally CPU and memory bount).

First, I got results that seemed to show prefetching kicking butt on my i7-6700HQ (Skylake):

Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=221, MiB/s=11583
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=221, MiB/s=11583
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=204, MiB/s=12549
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=200, MiB/s=12800
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=201, MiB/s=12736
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=197, MiB/s=12994
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305

Eyeballing the numbers, prefetch achieves something a bit above 16 GiB/s and without only about 12.5, so prefetch is increasing speed by about 30%. Right?

Not so fast. Remembering that the powersaving mode has all sorts of wonderful interactions on modern chips, I changed my Linux CPU governor to performance from the default of powersave¹. Now I get:

Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=155, MiB/s=16516
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=144, MiB/s=17777
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=144, MiB/s=17777
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=152, MiB/s=16842
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=159, MiB/s=16100
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=163, MiB/s=15705
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=161, MiB/s=15900

It's a total toss-up. Both with and without prefetching seem to perform identically. So either hardware prefetching is less aggressive in the high powersaving modes, or there is some other interaction with power saving that behaves differently with the explicit software prefetches.

Investigation

In fact, the difference between prefetching and not is even more extreme if you change the benchark. The existing benchmark alternates between runs with prefetching on and off, and it turns out that this helped the "off" variant because the speed increase which occurs in the "on" test partly carries over to the subsequent off test². If you run only the "off" test you get results around 9 GiB/s:

Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=280, MiB/s=9142
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=277, MiB/s=9241
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=285, MiB/s=8982

... versus about 17 GiB/s for the prefetching version:

Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=149, MiB/s=17181
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=148, MiB/s=17297
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=148, MiB/s=17297

So the prefetching version is almost twice as fast.

Let's take a look at what's going on with perf stat, for both the **off* version:

Performance counter stats for './prefetch-test off':

   2907.485684      task-clock (msec)         #    1.000 CPUs utilized                                          
 3,197,503,204      cycles                    #    1.100 GHz                    
 2,158,244,139      instructions              #    0.67  insns per cycle        
   429,993,704      branches                  #  147.892 M/sec                  
        10,956      branch-misses             #    0.00% of all branches

... and the on version:

   1502.321989      task-clock (msec)         #    1.000 CPUs utilized                          
 3,896,143,464      cycles                    #    2.593 GHz                    
 2,576,880,294      instructions              #    0.66  insns per cycle        
   429,853,720      branches                  #  286.126 M/sec                  
        11,444      branch-misses             #    0.00% of all branches

The difference is that the version with prefetching on consistently runs at the max non-turbo frequency of ~2.6 GHz (I have disabled turbo via an MSR). The version without prefetching, however, has decided to run at a much lower speed of 1.1 GHz. Such large CPU differences often also reflect a large difference in uncore frequency, which can explain the worse bandwdith.

Now we've seen this before, and it is probably an outcome of the Energy Efficient Turbo feature on recent Intel chips, which try to ramp down the CPU frequency when they determine a process is mostly memory bound, presumably since increased CPU core speed doesn't provide much benefit in those cases. As we can see here, this assumption isn't always true, but it isn't clear to me if the tradeoff is a bad one in general, or perhaps the heuristic only occasionally gets it wrong.

¹ I'm running the intel_pstate driver, which is the default for Intel chips on recent kernels which implements "hardware p-states", also known as "HWP". Command used: sudo cpupower -c 0,1,2,3 frequency-set -g performance.

² Conversely, the slowdown from the "off" test partly carries over into the "on" test, although the effect is less extreme, possibly because the powersaving "ramp up" behavior is faster than "ramp down".

answered Oct 27 '22 12:10

BeeOnRope

Related questions
                            
                                how to count cycles?
                            
                                WPF DataGrid DataContext extremely slow
                            
                                Suitable hash code methods for an array of bytes?
                            
                                Why does haskell enumerator based IO call sigprocmask so often?
                            
                                What's the fastest way to get a URLs status code using HttpClient
                            
                                Why is my python/numpy example faster than pure C implementation?
                            
                                C# thread-safe getter performance differences
                            
                                How to improve performance of populating a massive tree view?
                            
                                Can having multiple threads on a single core system still improve performance?
                            
                                How to repeat each of a Python list's elements n times with itertools only?
                            
                                Java Settings for Netbeans 6.8 on OSX 10.5.8 to optimize cpu usage
                            
                                Java reduce CPU usage
                            
                                Best way to get distinct values from large table
                            
                                is_file/file_exists performance and cache
                            
                                Extremely slow performance of hide/show function when dealing with large number of elements
                            
                                Strange performance behavior
                            
                                Magento SOAP v1 vs. v2 performance
                            
                                Efficient way to create a circulant matrix in R
                            
                                2 threads slower than 1?
                            
                                Which is fastest: $.each, .ForEach, for-loop or something else? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When should we use prefetch?

Tags:

performance

x86

arm

prefetch

大宝剑

People also ask

2 Answers

Leeor

Investigation

BeeOnRope

Recent Activity

Donate For Us