Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between prefetch for read or write

Tags:

c

gcc

prefetch

The gcc docs talk about a difference between prefetch for read and prefetch for write. What is the technical difference?

like image 800
user1978011 Avatar asked May 02 '15 14:05

user1978011


People also ask

Does prefetch increase performance?

Only in over-provisioned systems, can prefetching with low predictive accuracy improve performance. However, the data cache is obviously under-provisioned as it can keep only a subset of the data-set. The prefetched data typically shares the cache space with demand-paged data.

What is prefetch used for?

Prefetching allows a browser to silently fetch the necessary resources needed to display content that a user might access in the near future. The browser is able to store these resources in its cache enabling it to deliver the requested data faster.

What is the difference between cache and prefetch?

Proactively prefetching data brings the data into the cache before the actual requests occur. Passively caching data, on the other hand, only fetches the missed data from the backend storage after the requests arrive. There is a trade-off between prefetching and caching.

What is the use of prefetch buffer in microprocessor?

The prefetch buffer includes a buffer storage having at least one entry for storing prefetched data and an address tag, which is to be used for searching the data, as a pair; a data searcher for searching, from the data stored in the buffer storage, for data having an address requested by the CPU; and an address ...


2 Answers

On the CPU level, a software prefetch (as opposed to ones trigger by the hardware itself) are a convenient way to hint to the CPU that a line is about to be accessed, and you want it prefetched in advance to save the latency.

If the access will be a simple read, you would want a regular prefetch, which would behave similarly to a normal load from memory (aside from not blocking the CPU in case it misses, not faulting in case the address is bad, and all sorts of other benefits, depending on the micro architecture).

However, if you intend to write to that line, and it also exists in another core, a simple read operation would not suffice. This is due to MESI-based cache handling protocols. A core must have ownership of a line before modifying it, so that it preserves coherency (if the same line gets modified in multiple cores, you will not be able to ensure correct ordering for these changes, and may even lose some of them, which is not allowed on normal WB memory types). Instead, a write operation will start by acquiring ownership of the line, and snooping it out of any other core / socket that may hold a copy. Only then can the write occur. A read operation (demand or prefetch) would have left the line in other cores in a shared state, which is good if the line is read multiple times by many cores, but doesn't help you if your core later writes to it.

To allow useful prefetching for lines that will later be written to, most CPU companies support special prefetches for writing. In x86, both Intel and AMD support the prefetchW instruction, which should have the effect of a write (i.e. - acquiring sole ownership of a line, and invalidating any other copy if it). Note that not all CPUs support that (even within the same family, not all generations have it), and not all compiler versions enable it.

Here's an example (with gcc 4.8.2) - note that you need to enable it explicitly here -

#include <emmintrin.h>

int main() {
    long long int a[100];
    __builtin_prefetch (&a[0], 0, 0);
    __builtin_prefetch (&a[16], 0, 1);
    __builtin_prefetch (&a[32], 0, 2);
    __builtin_prefetch (&a[48], 0, 3);
    __builtin_prefetch (&a[64], 1, 0);
    return 0;
}

compiled with gcc -O3 -mprfchw prefetchw.c -c , :

0000000000000000 <main>:
   0:   48 81 ec b0 02 00 00    sub    $0x2b0,%rsp
   7:   48 8d 44 24 88          lea    -0x78(%rsp),%rax
   c:   0f 18 00                prefetchnta (%rax)
   f:   0f 18 98 80 00 00 00    prefetcht2 0x80(%rax)
  16:   0f 18 90 00 01 00 00    prefetcht1 0x100(%rax)
  1d:   0f 18 88 80 01 00 00    prefetcht0 0x180(%rax)
  24:   0f 0d 88 00 02 00 00    prefetchw 0x200(%rax)
  2b:   31 c0                   xor    %eax,%eax
  2d:   48 81 c4 b0 02 00 00    add    $0x2b0,%rsp
  34:   c3                      retq

If you play with the 2nd argument you'd notice that the hint levels are ignores for prefetchW, since it doesn't support temporal level hints. By the way, if you remove the -mprfchw flag, gcc will convert this into a normal read prefetch (I haven't tried different -march/mattr settings, maybe some of them include it as well).

like image 82
Leeor Avatar answered Sep 20 '22 12:09

Leeor


The difference relates to whether you expect that memory to only be read soon, or also to be written. In the later case, the CPU may be able to optimize differently. Remember, prefetch is only a hint, so GCC may ignore it.

To quote the GCC prefetch project page:

Some data prefetch instructions make a distinction between memory which is expected to be read and memory which is expected to be written. When data is to be written, a prefetch instruction can move a block into the cache so that the expected store will be to the cache. Prefetch for write generally brings the data into the cache in an exclusive or modified state.

like image 22
David Roundy Avatar answered Sep 21 '22 12:09

David Roundy