I have an object of 64 byte in size: <pre class="prettyprint"><code>typedef struct _object{ int value; char pad[60]; } object; </code></pre> in main I am initializing array of object: <pre class="prettyprint"><code>volatile object * array; int arr_size = 1000000; array = (object *) malloc(arr_size * sizeof(object)); for(int i=0; i < arr_size; i++){ array[i].value = 1; _mm_clflush(&array[i]); } _mm_mfence(); </code></pre> Then loop again through each element. This is the loop I am counting events for: <pre class="prettyprint"><code>int tmp; for(int i=0; i < arr_size-105; i++){ array[i].value = 2; //tmp = array[i].value; _mm_mfence(); } </code></pre> having mfence does not make any sense here but I was tying something else and accidentally found that if I have store operation, without mfence I get half million of RFO requests (measured by papi L2_RQSTS.ALL_RFO event), which means that another half million was L1 hit, prefetched before demand. However including mfence results in 1 million RFO requests, giving RFO_HITs, that means that cache line is only prefetched in L2, not in L1 cache anymore. Besides the fact that Intel documentation somehow indicates otherwise: "data can be brought into the caches speculatively just before, during, or after the execution of an MFENCE instruction." I checked with load operations. without mfence I get up to 2000 L1 hit, whereas with mfence, I have up to 1 million L1 hit (measured with papi MEM_LOAD_RETIRED.L1_HIT event). The cache lines are prefetched in L1 for load instruction. So it should not be the case that including mfence blocks prefetching. Both the store and load operations take almost the same time - without mfence 5-6 msec, with mfence 20 msec. I went through other questions regarding mfence but it's not mentioned what is expected behavior for it with prefetching and I don't see good enough reason or explanation why it would block prefetching in L1 cache with only store operations. Or I might be missing something for mfence description? I am testing on Skylake miroarchitecture, however checked with Broadwell and got the same result.

It's not L1 prefetching that causes the counter values you see: the effect remains even if you disable the L1 prefetchers. In fact, the effect remains if you disable all prefetchers except the L2 streamer: <pre class="prettyprint"><code>wrmsr -a 0x1a4 "$((2#1110))" </code></pre> If you do disable the L2 streamer, however, the counts are as you'd expect: you see roughly 1,000,000 <code>L2.RFO_MISS</code> and <code>L2.RFO_ALL</code> even without the <code>mfence</code>. First, it is important to note that the <code>L2_RQSTS.RFO_*</code> events count do not count RFO events originating from the L2 streamer. You can see the details here, but basically the umask for each of the 0x24 RFO events are: <pre class="prettyprint"><code>name umask RFO_MISS 0x22 RFO_HIT 0x42 ALL_RFO 0xE2 </code></pre> Note that none of the umask values have the <code>0x10</code> bit which indicates that events which originate from the L2 streamer should be tracked. It seems like what happens is that when the L2 streamer is active, many of the events that you might expect to be assigned to one of those events are instead "eaten" by the L2 prefetcher events instead. What likely happens is that the L2 prefetcher is running ahead of the request stream, and when the demand RFO comes in from L1, it finds a request already in progress from the L2 prefetcher. This only increments again the <code>umask |= 0x10</code> version of the event (indeed I get 2,000,000 total references when including that bit), which means that <code>RFO_MISS</code> and <code>RFO_HIT</code> and <code>RFO_ALL</code> will miss it. It's somewhat analogous to the "fb_hit" scenario, where L1 loads neither miss nor hit exactly, but hit an in-progress load - but the complication here is the load was initiated by the L2 prefetcher. The <code>mfence</code> just slows everything down enough that the L2 prefetcher almost always has time to bring the line all the way to L2, giving an <code>RFO_HIT</code> count. I don't think the L1 prefetchers are involved here at all (shown by the fact that this works the same if you turn them off): as far as I know L1 prefetchers don't interact with stores, only loads. Here are some useful <code>perf</code> commands you can use to see the difference in including the "L2 streamer origin" bit. Here's w/o the L2 streamer events: <pre class="prettyprint"><code>perf stat --delay=1000 -e cpu/event=0x24,umask=0xef,name=l2_rqsts_references/,cpu/event=0x24,umask=0xe2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xc2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x22,name=l2_rqsts_rfo_miss/ </code></pre> and with them included: <pre class="prettyprint"><code>perf stat --delay=1000 -e cpu/event=0x24,umask=0xff,name=l2_rqsts_references/,cpu/event=0x24,umask=0xf2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xd2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x32,name=l2_rqsts_rfo_miss/ </code></pre> I ran these against this code (with the <code>sleep(1)</code> lining up with the <code>--delay=1000</code> command passed to perf to exclude the init code): <pre class="prettyprint"><code>#include <time.h> #include <immintrin.h> #include <stdio.h> #include <unistd.h> typedef struct _object{ int value; char pad[60]; } object; int main() { volatile object * array; int arr_size = 1000000; array = (object *) malloc(arr_size * sizeof(object)); for(int i=0; i < arr_size; i++){ array[i].value = 1; _mm_clflush((const void*)&array[i]); } _mm_mfence(); sleep(1); // printf("Starting main loop after %zu ms\n", (size_t)clock() * 1000u / CLOCKS_PER_SEC); int tmp; for(int i=0; i < arr_size-105; i++){ array[i].value = 2; //tmp = array[i].value; // _mm_mfence(); } } </code></pre>

Why does using MFENCE with store instruction block prefetching in L1 cache?

Tags:

performance

x86

intel

prefetch

memory-barriers

I have an object of 64 byte in size:

typedef struct _object{
  int value;
  char pad[60];
} object;

in main I am initializing array of object:

volatile object * array;
int arr_size = 1000000;
array = (object *) malloc(arr_size * sizeof(object));

for(int i=0; i < arr_size; i++){
    array[i].value = 1;
    _mm_clflush(&array[i]);
}
_mm_mfence();

Then loop again through each element. This is the loop I am counting events for:

int tmp;
for(int i=0; i < arr_size-105; i++){
    array[i].value = 2;
    //tmp = array[i].value;
     _mm_mfence();
 }

having mfence does not make any sense here but I was tying something else and accidentally found that if I have store operation, without mfence I get half million of RFO requests (measured by papi L2_RQSTS.ALL_RFO event), which means that another half million was L1 hit, prefetched before demand. However including mfence results in 1 million RFO requests, giving RFO_HITs, that means that cache line is only prefetched in L2, not in L1 cache anymore.

Besides the fact that Intel documentation somehow indicates otherwise: "data can be brought into the caches speculatively just before, during, or after the execution of an MFENCE instruction." I checked with load operations. without mfence I get up to 2000 L1 hit, whereas with mfence, I have up to 1 million L1 hit (measured with papi MEM_LOAD_RETIRED.L1_HIT event). The cache lines are prefetched in L1 for load instruction.

So it should not be the case that including mfence blocks prefetching. Both the store and load operations take almost the same time - without mfence 5-6 msec, with mfence 20 msec. I went through other questions regarding mfence but it's not mentioned what is expected behavior for it with prefetching and I don't see good enough reason or explanation why it would block prefetching in L1 cache with only store operations. Or I might be missing something for mfence description?

I am testing on Skylake miroarchitecture, however checked with Broadwell and got the same result.

601

asked May 13 '19 17:05

Ana Khorguani

1 Answers

It's not L1 prefetching that causes the counter values you see: the effect remains even if you disable the L1 prefetchers. In fact, the effect remains if you disable all prefetchers except the L2 streamer:

wrmsr -a 0x1a4 "$((2#1110))"

If you do disable the L2 streamer, however, the counts are as you'd expect: you see roughly 1,000,000 L2.RFO_MISS and L2.RFO_ALL even without the mfence.

First, it is important to note that the L2_RQSTS.RFO_* events count do not count RFO events originating from the L2 streamer. You can see the details here, but basically the umask for each of the 0x24 RFO events are:

name      umask
RFO_MISS   0x22
RFO_HIT    0x42
ALL_RFO    0xE2

Note that none of the umask values have the 0x10 bit which indicates that events which originate from the L2 streamer should be tracked.

It seems like what happens is that when the L2 streamer is active, many of the events that you might expect to be assigned to one of those events are instead "eaten" by the L2 prefetcher events instead. What likely happens is that the L2 prefetcher is running ahead of the request stream, and when the demand RFO comes in from L1, it finds a request already in progress from the L2 prefetcher. This only increments again the umask |= 0x10 version of the event (indeed I get 2,000,000 total references when including that bit), which means that RFO_MISS and RFO_HIT and RFO_ALL will miss it.

It's somewhat analogous to the "fb_hit" scenario, where L1 loads neither miss nor hit exactly, but hit an in-progress load - but the complication here is the load was initiated by the L2 prefetcher.

The mfence just slows everything down enough that the L2 prefetcher almost always has time to bring the line all the way to L2, giving an RFO_HIT count.

I don't think the L1 prefetchers are involved here at all (shown by the fact that this works the same if you turn them off): as far as I know L1 prefetchers don't interact with stores, only loads.

Here are some useful perf commands you can use to see the difference in including the "L2 streamer origin" bit. Here's w/o the L2 streamer events:

perf stat --delay=1000 -e cpu/event=0x24,umask=0xef,name=l2_rqsts_references/,cpu/event=0x24,umask=0xe2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xc2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x22,name=l2_rqsts_rfo_miss/

and with them included:

perf stat --delay=1000 -e cpu/event=0x24,umask=0xff,name=l2_rqsts_references/,cpu/event=0x24,umask=0xf2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xd2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x32,name=l2_rqsts_rfo_miss/

I ran these against this code (with the sleep(1) lining up with the --delay=1000 command passed to perf to exclude the init code):

#include <time.h>
#include <immintrin.h>
#include <stdio.h>
#include <unistd.h>

typedef struct _object{
  int value;
  char pad[60];
} object;

int main() {
    volatile object * array;
    int arr_size = 1000000;
    array = (object *) malloc(arr_size * sizeof(object));

    for(int i=0; i < arr_size; i++){
        array[i].value = 1;
        _mm_clflush((const void*)&array[i]);
    }
    _mm_mfence();

    sleep(1);
    // printf("Starting main loop after %zu ms\n", (size_t)clock() * 1000u / CLOCKS_PER_SEC);

    int tmp;
    for(int i=0; i < arr_size-105; i++){
        array[i].value = 2;
        //tmp = array[i].value;
        // _mm_mfence();
    }
}

106

answered Oct 11 '22 18:10

BeeOnRope

Related questions
                            
                                efficient count distinct across columns of DataFrame, grouped by rows
                            
                                Has anyone noticed that EF Core 1.0 2015.1 has made the queries very inefficient
                            
                                Ruby - What are the differences between checking if block_given? and !block.nil?
                            
                                Angular2 performance with lots of input elements with two way binding
                            
                                Node.js GPS device tracking performance considerations
                            
                                pandas read_csv converters performance issue
                            
                                Efficient matrix transpose matrix multiplication in Eigen
                            
                                "Total frames" and "Janky frames" in dumpsys gfxinfo report
                            
                                How to improve order by performance with joins in mysql
                            
                                ASP.NET MVC Razor view vs AngularJS
                            
                                Store forwarding Address vs Data: What the difference between STD and STA in the Intel Optimization guide?
                            
                                JPA @OneToOne select lists with N+1 queries
                            
                                Fastest way of finding the only index of vector b where array A(i,j) == b
                            
                                Why is the final reduce step extremely slow in this MapReduce? (HiveQL, HDFS MapReduce)
                            
                                Difference in Running Time on Leet Code
                            
                                Using multiple websocket connections
                            
                                Competitive programming using Haskell
                            
                                Should I enable Gzip on Nginx server with SSL for a react app?
                            
                                Small branches in modern CPUs
                            
                                What is the most efficient way to paste strings in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With