I have an object of 64 byte in size:
typedef struct _object{
int value;
char pad[60];
} object;
in main I am initializing array of object:
volatile object * array;
int arr_size = 1000000;
array = (object *) malloc(arr_size * sizeof(object));
for(int i=0; i < arr_size; i++){
array[i].value = 1;
_mm_clflush(&array[i]);
}
_mm_mfence();
Then loop again through each element. This is the loop I am counting events for:
int tmp;
for(int i=0; i < arr_size-105; i++){
array[i].value = 2;
//tmp = array[i].value;
_mm_mfence();
}
having mfence does not make any sense here but I was tying something else and accidentally found that if I have store operation, without mfence I get half million of RFO requests (measured by papi L2_RQSTS.ALL_RFO event), which means that another half million was L1 hit, prefetched before demand. However including mfence results in 1 million RFO requests, giving RFO_HITs, that means that cache line is only prefetched in L2, not in L1 cache anymore.
Besides the fact that Intel documentation somehow indicates otherwise: "data can be brought into the caches speculatively just before, during, or after the execution of an MFENCE instruction." I checked with load operations. without mfence I get up to 2000 L1 hit, whereas with mfence, I have up to 1 million L1 hit (measured with papi MEM_LOAD_RETIRED.L1_HIT event). The cache lines are prefetched in L1 for load instruction.
So it should not be the case that including mfence blocks prefetching. Both the store and load operations take almost the same time - without mfence 5-6 msec, with mfence 20 msec. I went through other questions regarding mfence but it's not mentioned what is expected behavior for it with prefetching and I don't see good enough reason or explanation why it would block prefetching in L1 cache with only store operations. Or I might be missing something for mfence description?
I am testing on Skylake miroarchitecture, however checked with Broadwell and got the same result.
One main advantage of software prefetching is that it reduces the number of compulsory cache misses. The following example shows how a prefetch instruction would be added into code to improve cache performance.
Prefetching allows a browser to silently fetch the necessary resources needed to display content that a user might access in the near future. The browser is able to store these resources in its cache enabling it to deliver the requested data faster.
Prefetching is the loading of a resource before it is required to decrease the time waiting for that resource. Examples include instruction prefetching where a CPU caches data and instruction blocks before they are executed, or a web browser requesting copies of commonly accessed web pages.
Ideally, software prefetching should bring data from main memory into the L2 cache first, before prefetching from the L2 cache to the L1 cache, as shown in Figure 21.2. The prefetch instructions are described in more detail in the Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual.
It's not L1 prefetching that causes the counter values you see: the effect remains even if you disable the L1 prefetchers. In fact, the effect remains if you disable all prefetchers except the L2 streamer:
wrmsr -a 0x1a4 "$((2#1110))"
If you do disable the L2 streamer, however, the counts are as you'd expect: you see roughly 1,000,000 L2.RFO_MISS
and L2.RFO_ALL
even without the mfence
.
First, it is important to note that the L2_RQSTS.RFO_*
events count do not count RFO events originating from the L2 streamer. You can see the details here, but basically the umask for each of the 0x24 RFO events are:
name umask
RFO_MISS 0x22
RFO_HIT 0x42
ALL_RFO 0xE2
Note that none of the umask values have the 0x10
bit which indicates that events which originate from the L2 streamer should be tracked.
It seems like what happens is that when the L2 streamer is active, many of the events that you might expect to be assigned to one of those events are instead "eaten" by the L2 prefetcher events instead. What likely happens is that the L2 prefetcher is running ahead of the request stream, and when the demand RFO comes in from L1, it finds a request already in progress from the L2 prefetcher. This only increments again the umask |= 0x10
version of the event (indeed I get 2,000,000 total references when including that bit), which means that RFO_MISS
and RFO_HIT
and RFO_ALL
will miss it.
It's somewhat analogous to the "fb_hit" scenario, where L1 loads neither miss nor hit exactly, but hit an in-progress load - but the complication here is the load was initiated by the L2 prefetcher.
The mfence
just slows everything down enough that the L2 prefetcher almost always has time to bring the line all the way to L2, giving an RFO_HIT
count.
I don't think the L1 prefetchers are involved here at all (shown by the fact that this works the same if you turn them off): as far as I know L1 prefetchers don't interact with stores, only loads.
Here are some useful perf
commands you can use to see the difference in including the "L2 streamer origin" bit. Here's w/o the L2 streamer events:
perf stat --delay=1000 -e cpu/event=0x24,umask=0xef,name=l2_rqsts_references/,cpu/event=0x24,umask=0xe2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xc2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x22,name=l2_rqsts_rfo_miss/
and with them included:
perf stat --delay=1000 -e cpu/event=0x24,umask=0xff,name=l2_rqsts_references/,cpu/event=0x24,umask=0xf2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xd2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x32,name=l2_rqsts_rfo_miss/
I ran these against this code (with the sleep(1)
lining up with the --delay=1000
command passed to perf to exclude the init code):
#include <time.h>
#include <immintrin.h>
#include <stdio.h>
#include <unistd.h>
typedef struct _object{
int value;
char pad[60];
} object;
int main() {
volatile object * array;
int arr_size = 1000000;
array = (object *) malloc(arr_size * sizeof(object));
for(int i=0; i < arr_size; i++){
array[i].value = 1;
_mm_clflush((const void*)&array[i]);
}
_mm_mfence();
sleep(1);
// printf("Starting main loop after %zu ms\n", (size_t)clock() * 1000u / CLOCKS_PER_SEC);
int tmp;
for(int i=0; i < arr_size-105; i++){
array[i].value = 2;
//tmp = array[i].value;
// _mm_mfence();
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With