Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the effect of second argument in _builtin_prefetch()?

The GCC doc here specifies the usage of _buitin_prefetch.

Third argument is perfect. If it is 0, compiler generates prefetchtnta (%rax) instruction If it is 1, compiler generates prefetcht2 (%rax) instruction If it is 2, compiler generates prefetcht1 (%rax) instruction If it is 3 (default), compiler generates prefetcht0 (%rax) instruction.

If we vary third argument the opcode already changed accordingly.

But second argument do not seem to have any effect.

__builtin_prefetch(&x,1,2);
__builtin_prefetch(&x,0,2);
__builtin_prefetch(&x,0,1);
__builtin_prefetch(&x,0,0);

The above is the sample piece of code, that generated:

The following is the assembly:

 27:    0f 18 10                prefetcht1 (%rax)
  2a:   48 8d 45 fc             lea    -0x4(%rbp),%rax
  2e:   0f 18 10                prefetcht1 (%rax)
  31:   48 8d 45 fc             lea    -0x4(%rbp),%rax
  35:   0f 18 18                prefetcht2 (%rax)
  38:   48 8d 45 fc             lea    -0x4(%rbp),%rax
  3c:   0f 18 00                prefetchnta (%rax)

One can observe the change in opcodes wrt 3rd argument. But even if I changed 2nd argument (that specifies read or write), the assembly code remains the same. <27,2a> and <2e,31>. So it not giving any information to the machine. Then what is the purpose of the second argument?

like image 728
ANTHONY Avatar asked Nov 09 '16 18:11

ANTHONY


2 Answers

As Margaret points out, one of the args is rw.

Baseline x86-64 (SSE2) does not include write-prefetch instructions, but they exist as ISA extensions. As usual, compilers won't use them unless you tell them you're compiling for a target that supports it. (But they will safely run as a NOP on any non-ancient CPU.)

The two instructions are: PREFETCHW (into L1d cache like PREFETCHT0) and PREFETCHWT1 (into L2 cache like PREFETCHT1). They prefetch a line into Exclusive MESI state by sending out an RFO (Read-For-Ownership). This invalidates every other copy of the line in every other core. From that state, the store buffer can commit data to a line (and flip it to Modified) without any further off-core traffic. Or if not modified before eviction, can simply be dropped.

The PREFETCHW instruction is merely a hint and does not affect program behavior. If executed, this instruction moves data closer to the processor and invalidates other cached copies in anticipation of the line being written to in the future.

They have nearly the same machine encoding, same OF 0D opcode, differing only in /1 or /2 in the ModRM /r field. Just like how read-prefetch PREFETCHT0/T1/T2/NTA share an opcode and are differentiated only by /0 (NTA), /1 (T0), etc. in the ModRM /r field. Using /r bits as extra opcode bits is not unique; other one-operand and immediate instructions also do that.

related: Difference between prefetch for read or write


PREFETCHW originally appeared in AMD's 3DNow!, but has its own feature bit so that CPUs can indicate support for it but not other 3DNow! (packed-float in MMX regs) instructions.

PREFETCHWT1 also has its own CPUID feature bit, but might be associated with AVX512PF. It appears to only be available in Xeon Phi (Knight's Landing / Knight's Mill), not mainstream Skylake-AVX512, same as AVX512PF (https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512). (Evidence: According to Intel's Future Extensions manual, CPUID with EAX=7/ECX=0 gives a feature bitmap in ECX including Bit 00: PREFETCHWT1 (Intel® Xeon Phi™ only.) Also mailing list.


__builtin_prefetch(p,1,2); compiles as follows with GCC:

  • PREFETCHT1 with no -m options, or -march=haswell or older Intel.
  • PREFETCHW with an AMD target, like -march=k8 or -march=bdver2 (Piledriver).
  • PREFETCHW with -march=broadwell or newer Intel SnB-family, and/or -mprfchw for any arch.
  • PREFETCHWT1 with -mprefetchwt1. (If PREFETCHW is also available, gcc uses it for locality=3, but PREFETCHWT1 for locality<=2.) GCC for some reason doesn't enable this as part of -march=knl or -march=knm, but clang does. I think this is an oversight in GCC.

  • -mprefetchwt1 implies -mprfchw. See also the x86 options section in the GCC manual for more about -march=native vs. -march=whatever to enable a set of ISA extensions and set -mtune=whatever appropriately.

Check it out on the Godbolt compiler explorer, for -march=haswell vs. -march=broadwell -mprefetchwt1. Or modify the compiler args yourself.

clang -O3 -march=knl, and gcc -O3 -march=broadwell -mprefetchwt1 make the same asm:

pref:
        prefetchwt1     [rdi]    #   __builtin_prefetch(p,1,2);  // KNL only, otherwise we get prefetchw
        prefetchw       [rdi]    #   __builtin_prefetch(p,1,3);

        prefetcht0      [rdi]    #   __builtin_prefetch(p,0,3);
        prefetcht1      [rdi]    #   __builtin_prefetch(p,0,2);
        prefetcht2      [rdi]    #   __builtin_prefetch(p,0,1);
        prefetchnta     [rdi]    #   __builtin_prefetch(p,0,0);
        ret

Also note that their 0F 0D r/m8 machine code decodes as a multi-byte NOP on non-ancient CPUs that don't have the PREFETCHW or 3DNow! feature-bit. On early 64-bit Intel CPUs, it's an illegal instruction. (Newer versions of Windows require that PREFETCHW executes without faulting, and in that context people talk about a CPU "supporting PREFETCHW" even if it runs as a NOP).

It's possible that CPUs which support PREFETCHW but not PREFETCHWT1 will actually run PREFETCHWT1 as if it were PREFETCHW, but I haven't tested. (It should be testable by running threads on different cores, one doing repeated stores to a location and the other doing PREFETCHWT1 vs. PREFETCHW vs. read prefetch vs. NOP, and see how the writing thread's throughput is affected.)


It might be preferable to use a read-intent prefetch instead of a NOP, though (like GCC does). But you probably don't want to do a PREFETCHW and a PREFETCHT0, because too many prefetch instructions aren't a good thing. (especially for Intel IvyBridge, which has some kind of performance bug for prefetch-instruction throughput. But IvB would run PREFETCHW as a NOP, so you're only getting one prefetch on that uarch.)

Tuning software-prefetch is hard: too much prefetching means fewer execution resources spent doing real work, if HW prefetch does its job successfully. See Cost of a sub-optimal cacheline prefetch and What Every Programmer Should Know About Memory?

like image 99
Peter Cordes Avatar answered Sep 22 '22 12:09

Peter Cordes


From the same link you posted:

There are two optional arguments, rw and locality. The value of rw is a compile-time constant one or zero; one means that the prefetch is preparing for a write to the memory address and zero, the default, means that the prefetch is preparing for a read.

The x86 architecture has no distinction between a read and a write prefetch.
This doesn't mean that you should ignore the second argument as writing code in C is done to improve portability. Even if in your machine the second argument is not used, it can be used when compiling to different architectures.

EDIT As @PeterCordes pointed out in his comment, x86 actually have a prefetch instruction in anticipation of a write.
It differs from the other prefetch instructions as it invalidates other cached instanced of the line fetched (and set it to exclusive state).

like image 40
Margaret Bloom Avatar answered Sep 22 '22 12:09

Margaret Bloom