Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are there too many demand rfo offcore responses /offcore requests?

Whiskey Lake i7-8565U/Ubuntu 18.04/HT enabled

Consider the following code that writes some garbage data that happened to be in registers ymm0 and ymm1 into 16 MiB statically allocated WB memory in a loop consisting of 6400 iteration (so page fault impact is negligible):

;rdx = 16MiB >> 3
xor rcx, rcx
store_loop:
vmovdqa [rdi + rcx*8], ymm0
vmovdqa [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja store_loop

Using taskset -c 3 ./bin I'm measuring RFO requests by this example and here is the results:

Performance counter stats for 'taskset -c 3 ./bin':

     1695029000      L1-dcache-load-misses     # 2325,60% of all L1-dcache hits    (24,93%)
        72885527      L1-dcache-loads                                               (24,99%)
     3411237144      L1-dcache-stores                                              (25,05%)
       946374671      l2_rqsts.all_rfo                                              (25,11%)
       451047123      l2_rqsts.rfo_hit                                              (25,15%)
       495868337      l2_rqsts.rfo_miss                                             (25,15%)
     2367931179      l2_rqsts.all_pf                                               (25,14%)
       568168558      l2_rqsts.pf_hit                                               (25,08%)
     1785300075      l2_rqsts.pf_miss                                              (25,02%)
     1217663928      offcore_requests.demand_rfo                                     (24,96%)
     1963262031      offcore_response.demand_rfo.any_response                                     (24,91%)
           108536      dTLB-load-misses          #    0,20% of all dTLB cache hits   (24,91%)
        55540014      dTLB-loads                                                    (24,91%)
        26310618      dTLB-store-misses                                             (24,91%)
     3412849640      dTLB-stores                                                   (24,91%)
    27265942916      cycles                                                        (24,91%)

       6,681218065 seconds time elapsed

       6,584426000 seconds user
       0,096006000 seconds sys

The description of l2_rqsts.all_rfo is

Counts the total number of RFO (read for ownership) requests to L2 cache. L2 RFO requests include both L1D demand RFO misses as well as L1D RFO prefetches.

suggesting that DCU can do some sort of RFO prefetches. It was not clear from the desctiption of DCU from Intel Optimization Manual/2.6.2.4:

Data cache unit (DCU) prefetcher — This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.

So I guess that DCU follows the "access type": If it is RFO then DCU does RFO prefetch.

All of those RFO prefetches should go to L2 along with demand RFO and only some of them (l2_rqsts.rfo_miss) should go to the uncore. The offcore_requests.demand_rfo counts only demand rfo, but l2_rqsts.rfo_miss accounts all rfo (demand + dcu prefectch) meaning that the inequality offcore_requests.demand_rfo < l2_rqsts.rfo_miss should be held.

QUESTION 1: Why is l2_rqsts.rfo_miss much less then offcore_requests.demand_rfo (even l2_rqsts.all_rfo less then offcore_requests.demand_rfo)

I expected that demand offcore_requests.demand_rfo can be matched up with offcore_response.demand_rfo.any_response so there should be approximately equal numbers for those Core PMU events

QUESTION 2: Why is offcore_response.demand_rfo.any_response almost 1.5 times more then offcore_requests.demand_rfo?

I'm guessing that L2-streamer also does some RFO prefetches, but it should not be accounted in offcore_requests.demand_rfo anyway.


UPD:

$ sudo rdmsr -p 3 0x1A4
1

L2-Streamer off

 Performance counter stats for 'taskset -c 3 ./bin':

     1672633985      L1-dcache-load-misses     # 2272,75% of all L1-dcache hits    (24,96%)
        73595056      L1-dcache-loads                                               (25,00%)
     3409928481      L1-dcache-stores                                              (25,00%)
     1593190436      l2_rqsts.all_rfo                                              (25,04%)
        16582758      l2_rqsts.rfo_hit                                              (25,07%)
     1579107608      l2_rqsts.rfo_miss                                             (25,07%)
       124294129      l2_rqsts.all_pf                                               (25,07%)
        22674837      l2_rqsts.pf_hit                                               (25,07%)
       102019160      l2_rqsts.pf_miss                                              (25,07%)
     1661232864      offcore_requests.demand_rfo                                     (25,02%)
     3287688173      offcore_response.demand_rfo.any_response                                     (24,98%)
           139247      dTLB-load-misses          #    0,25% of all dTLB cache hits   (24,94%)
        56823458      dTLB-loads                                                    (24,90%)
        26343286      dTLB-store-misses                                             (24,90%)
     3384264241      dTLB-stores                                                   (24,94%)
    37782766410      cycles                                                        (24,94%)

       9,320791474 seconds time elapsed

       9,213383000 seconds user
       0,099928000 seconds sys

As can be seen offcore_requests.demand_rfo got closer to l2_rqsts.rfo_miss, but still there is some difference. In the Intel docs of OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD I found the following:

Note: A prefetch promoted to Demand is counted from the promotion point.

So my guess is that L2-prefetches were promoted to Demand and counted in the Demand offcore requests. But it does not explain the difference between offcore_response.demand_rfo.any_response and offcore_requests.demand_rfo which is almost twice now:

offcore_requests.demand_rfo 1 661 232 864

vs

offcore_response.demand_rfo.any_response 3 287 688 173


UPD:

$ sudo rdmsr -p 3 0x1A4
3

All L2 prefetchers off

 Performance counter stats for 'taskset -c 3 ./bin':

     1686560752      L1-dcache-load-misses     # 2138,14% of all L1-dcache hits    (23,44%)
        78879830      L1-dcache-loads                                               (23,48%)
     3409552015      L1-dcache-stores                                              (23,53%)
     1670187931      l2_rqsts.all_rfo                                              (23,56%)
            15674      l2_rqsts.rfo_hit                                              (23,59%)
     1676538346      l2_rqsts.rfo_miss                                             (23,58%)
           156206      l2_rqsts.all_pf                                               (23,59%)
            14436      l2_rqsts.pf_hit                                               (23,59%)
           173163      l2_rqsts.pf_miss                                              (23,59%)
     1671606174      offcore_requests.demand_rfo                                     (23,59%)
     3301546970      offcore_response.demand_rfo.any_response                                     (23,59%)
           140335      dTLB-load-misses          #    0,21% of all dTLB cache hits   (23,57%)
        68010546      dTLB-loads                                                    (23,53%)
        26329766      dTLB-store-misses                                             (23,49%)
     3429416286      dTLB-stores                                                   (23,45%)
    39462328435      cycles                                                        (23,42%)

       9,699770319 seconds time elapsed

       9,596304000 seconds user
       0,099961000 seconds sys

Now the total number of prefetch requests to l2 (from all prefetchers) is 156 206 l2_rqsts.all_pf.


UPD:

$ sudo rdmsr -p 3 0x1A4
7

̶A̶l̶l̶ ̶p̶r̶e̶f̶e̶t̶c̶h̶e̶r̶s̶ ̶t̶u̶r̶n̶e̶d̶ ̶o̶f̶f̶.̶ Only IP prefetcher enabled

 Performance counter stats for 'taskset -c 3 ./bin':

     1672643256      L1-dcache-load-misses     # 1893,36% of all L1-dcache hits    (24,92%)
        88342382      L1-dcache-loads                                               (24,96%)
     3411575868      L1-dcache-stores                                              (25,00%)
     1672628218      l2_rqsts.all_rfo                                              (25,04%)
            10585      l2_rqsts.rfo_hit                                              (25,04%)
     1684510576      l2_rqsts.rfo_miss                                             (25,04%)
            10042      l2_rqsts.all_pf                                               (25,04%)
             4368      l2_rqsts.pf_hit                                               (25,05%)
             9135      l2_rqsts.pf_miss                                              (25,05%)
     1684136160      offcore_requests.demand_rfo                                     (25,05%)
     3316673543      offcore_response.demand_rfo.any_response                                     (25,05%)
           133322      dTLB-load-misses          #    0,21% of all dTLB cache hits   (25,03%)
        64283883      dTLB-loads                                                    (24,99%)
        26195527      dTLB-store-misses                                             (24,95%)
     3392779428      dTLB-stores                                                   (24,91%)
    39627346050      cycles                                                        (24,88%)

       9,710779347 seconds time elapsed

       9,610209000 seconds user
       0,099981000 seconds sys

UPD:

$ sudo rdmsr -p 3 0x1A4
f

All prefetchers disabled

 Performance counter stats for 'taskset -c 3 ./bin':

     1695710457      L1-dcache-load-misses     # 2052,21% of all L1-dcache hits    (23,47%)
        82628503      L1-dcache-loads                                               (23,47%)
     3429579614      L1-dcache-stores                                              (23,47%)
     1682110906      l2_rqsts.all_rfo                                              (23,51%)
            12315      l2_rqsts.rfo_hit                                              (23,55%)
     1672591830      l2_rqsts.rfo_miss                                             (23,55%)
                 0      l2_rqsts.all_pf                                               (23,55%)
                 0      l2_rqsts.pf_hit                                               (23,55%)
                12      l2_rqsts.pf_miss                                              (23,55%)
     1662163396      offcore_requests.demand_rfo                                     (23,55%)
     3282743626      offcore_response.demand_rfo.any_response                                     (23,55%)
           126739      dTLB-load-misses          #    0,21% of all dTLB cache hits   (23,55%)
        59790090      dTLB-loads                                                    (23,55%)
        26373257      dTLB-store-misses                                             (23,55%)
     3426860516      dTLB-stores                                                   (23,55%)
    38282401051      cycles                                                        (23,51%)

       9,377335173 seconds time elapsed

       9,281050000 seconds user
       0,096010000 seconds sys

Even though prefetchers are disabled perf reports 12 as pf_miss (reproducible across different runs with different values). This is probably counting error. Also 1 672 591 830 l2_rqsts.rfo_miss has slightly larger value then 1 662 163 396 offcore_requests.demand_rfo which I also tend to interpret as counting error.


Hypothesis: DCU RFO Prefetch missing L2 and going off core are accounted in offcore_requests.demand_rfo.

The hypothesis works if L2-streamer switched off: 102 019 160 l2_rqsts.pf_miss + 1 579 107 608 l2_rqsts.rfo_miss = 1 681 126 768; 1 661 232 864 offcore_requests.demand_rfo

The hypothesis also works if all the prefetchers turned off: 1 684 510 576 l2_rqsts.rfo_miss; 1 684 136 160 offcore_requests.

In case of all PF turned off L1-dcache-load-misses is approximately equal to l2_rqsts.rfo_miss which in turns equals to offcore_requests.demand_rfo

The thing I still have no idea about is why offcore_response.demand_rfo.any_response has much larger value then offcore_requests.demand_rfo

like image 820
St.Antario Avatar asked Mar 02 '23 18:03

St.Antario


1 Answers

For Question 1, the answer (on Skylake at least, but very likely to be the same for Whisky lake) is that the L2 RFO events don't count when they are initiated by a prefetch: not when the prefetch is triggered and not even when the RFO later hits or misses in the L2. You can count these events by setting the prefetch bit on your event (set 0x10 in the umask) and in this case you'll see double counts as described here.

The events you see are a somewhat random subset of RFOs where the L2 prefetcher didn't help. The offcore counters apparently don't have such a problem: even if a request is initiated by a prefetch, it can be promoted to a demand request when a demand hits the request in progress.

You can find additional details here, and you should double check exactly what events your version of perf uses as Intel changed the event definitions as described in that last link.

like image 145
BeeOnRope Avatar answered Apr 28 '23 06:04

BeeOnRope