Why are there too many demand rfo offcore responses /offcore requests?

Question

Whiskey Lake i7-8565U/Ubuntu 18.04/HT enabled

Consider the following code that writes some garbage data that happened to be in registers ymm0 and ymm1 into 16 MiB statically allocated WB memory in a loop consisting of 6400 iteration (so page fault impact is negligible):

;rdx = 16MiB >> 3
xor rcx, rcx
store_loop:
vmovdqa [rdi + rcx*8], ymm0
vmovdqa [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja store_loop

Using taskset -c 3 ./bin I'm measuring RFO requests by this example and here is the results:

Performance counter stats for 'taskset -c 3 ./bin':

     1 695 029 000      L1-dcache-load-misses     # 2325,60% of all L1-dcache hits    (24,93%)
        72 885 527      L1-dcache-loads                                               (24,99%)
     3 411 237 144      L1-dcache-stores                                              (25,05%)
       946 374 671      l2_rqsts.all_rfo                                              (25,11%)
       451 047 123      l2_rqsts.rfo_hit                                              (25,15%)
       495 868 337      l2_rqsts.rfo_miss                                             (25,15%)
     2 367 931 179      l2_rqsts.all_pf                                               (25,14%)
       568 168 558      l2_rqsts.pf_hit                                               (25,08%)
     1 785 300 075      l2_rqsts.pf_miss                                              (25,02%)
     1 217 663 928      offcore_requests.demand_rfo                                     (24,96%)
     1 963 262 031      offcore_response.demand_rfo.any_response                                     (24,91%)
           108 536      dTLB-load-misses          #    0,20% of all dTLB cache hits   (24,91%)
        55 540 014      dTLB-loads                                                    (24,91%)
        26 310 618      dTLB-store-misses                                             (24,91%)
     3 412 849 640      dTLB-stores                                                   (24,91%)
    27 265 942 916      cycles                                                        (24,91%)

       6,681218065 seconds time elapsed

       6,584426000 seconds user
       0,096006000 seconds sys

The description of l2_rqsts.all_rfo is

Counts the total number of RFO (read for ownership) requests to L2 cache. L2 RFO requests include both L1D demand RFO misses as well as L1D RFO prefetches.

suggesting that DCU can do some sort of RFO prefetches. It was not clear from the desctiption of DCU from Intel Optimization Manual/2.6.2.4:

Data cache unit (DCU) prefetcher — This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.

So I guess that DCU follows the "access type": If it is RFO then DCU does RFO prefetch.

All of those RFO prefetches should go to L2 along with demand RFO and only some of them (l2_rqsts.rfo_miss) should go to the uncore. The offcore_requests.demand_rfo counts only demand rfo, but l2_rqsts.rfo_miss accounts all rfo (demand + dcu prefectch) meaning that the inequality offcore_requests.demand_rfo < l2_rqsts.rfo_miss should be held.

QUESTION 1: Why is l2_rqsts.rfo_miss much less then offcore_requests.demand_rfo (even l2_rqsts.all_rfo less then offcore_requests.demand_rfo)

I expected that demand offcore_requests.demand_rfo can be matched up with offcore_response.demand_rfo.any_response so there should be approximately equal numbers for those Core PMU events

QUESTION 2: Why is offcore_response.demand_rfo.any_response almost 1.5 times more then offcore_requests.demand_rfo?

I'm guessing that L2-streamer also does some RFO prefetches, but it should not be accounted in offcore_requests.demand_rfo anyway.

UPD:

$ sudo rdmsr -p 3 0x1A4
1

L2-Streamer off

 Performance counter stats for 'taskset -c 3 ./bin':

     1 672 633 985      L1-dcache-load-misses     # 2272,75% of all L1-dcache hits    (24,96%)
        73 595 056      L1-dcache-loads                                               (25,00%)
     3 409 928 481      L1-dcache-stores                                              (25,00%)
     1 593 190 436      l2_rqsts.all_rfo                                              (25,04%)
        16 582 758      l2_rqsts.rfo_hit                                              (25,07%)
     1 579 107 608      l2_rqsts.rfo_miss                                             (25,07%)
       124 294 129      l2_rqsts.all_pf                                               (25,07%)
        22 674 837      l2_rqsts.pf_hit                                               (25,07%)
       102 019 160      l2_rqsts.pf_miss                                              (25,07%)
     1 661 232 864      offcore_requests.demand_rfo                                     (25,02%)
     3 287 688 173      offcore_response.demand_rfo.any_response                                     (24,98%)
           139 247      dTLB-load-misses          #    0,25% of all dTLB cache hits   (24,94%)
        56 823 458      dTLB-loads                                                    (24,90%)
        26 343 286      dTLB-store-misses                                             (24,90%)
     3 384 264 241      dTLB-stores                                                   (24,94%)
    37 782 766 410      cycles                                                        (24,94%)

       9,320791474 seconds time elapsed

       9,213383000 seconds user
       0,099928000 seconds sys

As can be seen offcore_requests.demand_rfo got closer to l2_rqsts.rfo_miss, but still there is some difference. In the Intel docs of OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD I found the following:

Note: A prefetch promoted to Demand is counted from the promotion point.

So my guess is that L2-prefetches were promoted to Demand and counted in the Demand offcore requests. But it does not explain the difference between offcore_response.demand_rfo.any_response and offcore_requests.demand_rfo which is almost twice now:

offcore_requests.demand_rfo 1 661 232 864

vs

offcore_response.demand_rfo.any_response 3 287 688 173

UPD:

$ sudo rdmsr -p 3 0x1A4
3

All L2 prefetchers off

 Performance counter stats for 'taskset -c 3 ./bin':

     1 686 560 752      L1-dcache-load-misses     # 2138,14% of all L1-dcache hits    (23,44%)
        78 879 830      L1-dcache-loads                                               (23,48%)
     3 409 552 015      L1-dcache-stores                                              (23,53%)
     1 670 187 931      l2_rqsts.all_rfo                                              (23,56%)
            15 674      l2_rqsts.rfo_hit                                              (23,59%)
     1 676 538 346      l2_rqsts.rfo_miss                                             (23,58%)
           156 206      l2_rqsts.all_pf                                               (23,59%)
            14 436      l2_rqsts.pf_hit                                               (23,59%)
           173 163      l2_rqsts.pf_miss                                              (23,59%)
     1 671 606 174      offcore_requests.demand_rfo                                     (23,59%)
     3 301 546 970      offcore_response.demand_rfo.any_response                                     (23,59%)
           140 335      dTLB-load-misses          #    0,21% of all dTLB cache hits   (23,57%)
        68 010 546      dTLB-loads                                                    (23,53%)
        26 329 766      dTLB-store-misses                                             (23,49%)
     3 429 416 286      dTLB-stores                                                   (23,45%)
    39 462 328 435      cycles                                                        (23,42%)

       9,699770319 seconds time elapsed

       9,596304000 seconds user
       0,099961000 seconds sys

Now the total number of prefetch requests to l2 (from all prefetchers) is 156 206 l2_rqsts.all_pf.

UPD:

$ sudo rdmsr -p 3 0x1A4
7

̶A̶l̶l̶ ̶p̶r̶e̶f̶e̶t̶c̶h̶e̶r̶s̶ ̶t̶u̶r̶n̶e̶d̶ ̶o̶f̶f̶.̶ Only IP prefetcher enabled

 Performance counter stats for 'taskset -c 3 ./bin':

     1 672 643 256      L1-dcache-load-misses     # 1893,36% of all L1-dcache hits    (24,92%)
        88 342 382      L1-dcache-loads                                               (24,96%)
     3 411 575 868      L1-dcache-stores                                              (25,00%)
     1 672 628 218      l2_rqsts.all_rfo                                              (25,04%)
            10 585      l2_rqsts.rfo_hit                                              (25,04%)
     1 684 510 576      l2_rqsts.rfo_miss                                             (25,04%)
            10 042      l2_rqsts.all_pf                                               (25,04%)
             4 368      l2_rqsts.pf_hit                                               (25,05%)
             9 135      l2_rqsts.pf_miss                                              (25,05%)
     1 684 136 160      offcore_requests.demand_rfo                                     (25,05%)
     3 316 673 543      offcore_response.demand_rfo.any_response                                     (25,05%)
           133 322      dTLB-load-misses          #    0,21% of all dTLB cache hits   (25,03%)
        64 283 883      dTLB-loads                                                    (24,99%)
        26 195 527      dTLB-store-misses                                             (24,95%)
     3 392 779 428      dTLB-stores                                                   (24,91%)
    39 627 346 050      cycles                                                        (24,88%)

       9,710779347 seconds time elapsed

       9,610209000 seconds user
       0,099981000 seconds sys

UPD:

$ sudo rdmsr -p 3 0x1A4
f

All prefetchers disabled

 Performance counter stats for 'taskset -c 3 ./bin':

     1 695 710 457      L1-dcache-load-misses     # 2052,21% of all L1-dcache hits    (23,47%)
        82 628 503      L1-dcache-loads                                               (23,47%)
     3 429 579 614      L1-dcache-stores                                              (23,47%)
     1 682 110 906      l2_rqsts.all_rfo                                              (23,51%)
            12 315      l2_rqsts.rfo_hit                                              (23,55%)
     1 672 591 830      l2_rqsts.rfo_miss                                             (23,55%)
                 0      l2_rqsts.all_pf                                               (23,55%)
                 0      l2_rqsts.pf_hit                                               (23,55%)
                12      l2_rqsts.pf_miss                                              (23,55%)
     1 662 163 396      offcore_requests.demand_rfo                                     (23,55%)
     3 282 743 626      offcore_response.demand_rfo.any_response                                     (23,55%)
           126 739      dTLB-load-misses          #    0,21% of all dTLB cache hits   (23,55%)
        59 790 090      dTLB-loads                                                    (23,55%)
        26 373 257      dTLB-store-misses                                             (23,55%)
     3 426 860 516      dTLB-stores                                                   (23,55%)
    38 282 401 051      cycles                                                        (23,51%)

       9,377335173 seconds time elapsed

       9,281050000 seconds user
       0,096010000 seconds sys

Even though prefetchers are disabled perf reports 12 as pf_miss (reproducible across different runs with different values). This is probably counting error. Also 1 672 591 830 l2_rqsts.rfo_miss has slightly larger value then 1 662 163 396 offcore_requests.demand_rfo which I also tend to interpret as counting error.

Hypothesis: DCU RFO Prefetch missing L2 and going off core are accounted in offcore_requests.demand_rfo.

The hypothesis works if L2-streamer switched off: 102 019 160 l2_rqsts.pf_miss + 1 579 107 608 l2_rqsts.rfo_miss = 1 681 126 768; 1 661 232 864 offcore_requests.demand_rfo

The hypothesis also works if all the prefetchers turned off: 1 684 510 576 l2_rqsts.rfo_miss; 1 684 136 160 offcore_requests.

In case of all PF turned off L1-dcache-load-misses is approximately equal to l2_rqsts.rfo_miss which in turns equals to offcore_requests.demand_rfo

The thing I still have no idea about is why offcore_response.demand_rfo.any_response has much larger value then offcore_requests.demand_rfo

BeeOnRope · Accepted Answer

For Question 1, the answer (on Skylake at least, but very likely to be the same for Whisky lake) is that the L2 RFO events don't count when they are initiated by a prefetch: not when the prefetch is triggered and not even when the RFO later hits or misses in the L2. You can count these events by setting the prefetch bit on your event (set 0x10 in the umask) and in this case you'll see double counts as described here.

The events you see are a somewhat random subset of RFOs where the L2 prefetcher didn't help. The offcore counters apparently don't have such a problem: even if a request is initiated by a prefetch, it can be promoted to a demand request when a demand hits the request in progress.

You can find additional details here, and you should double check exactly what events your version of perf uses as Intel changed the event definitions as described in that last link.

Why are there too many demand rfo offcore responses /offcore requests?

Tags:

x86

assembly

cpu-cache

x86-64

rfo

St.Antario

1 Answers

BeeOnRope

Recent Activity

Donate For Us

Why are there too many demand rfo offcore responses /offcore requests?

Tags:

x86

assembly

cpu-cache

x86-64

rfo

St.Antario

1 Answers

BeeOnRope

Related questions

Recent Activity

Donate For Us