Whiskey Lake i7-8565U
/Ubuntu 18.04
/HT enabled
Consider the following code that writes some garbage data that happened to be in registers ymm0
and ymm1
into 16 MiB statically allocated WB memory in a loop consisting of 6400 iteration (so page fault impact is negligible):
;rdx = 16MiB >> 3
xor rcx, rcx
store_loop:
vmovdqa [rdi + rcx*8], ymm0
vmovdqa [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja store_loop
Using taskset -c 3 ./bin
I'm measuring RFO requests by this example and here is the results:
Performance counter stats for 'taskset -c 3 ./bin':
1 695 029 000 L1-dcache-load-misses # 2325,60% of all L1-dcache hits (24,93%)
72 885 527 L1-dcache-loads (24,99%)
3 411 237 144 L1-dcache-stores (25,05%)
946 374 671 l2_rqsts.all_rfo (25,11%)
451 047 123 l2_rqsts.rfo_hit (25,15%)
495 868 337 l2_rqsts.rfo_miss (25,15%)
2 367 931 179 l2_rqsts.all_pf (25,14%)
568 168 558 l2_rqsts.pf_hit (25,08%)
1 785 300 075 l2_rqsts.pf_miss (25,02%)
1 217 663 928 offcore_requests.demand_rfo (24,96%)
1 963 262 031 offcore_response.demand_rfo.any_response (24,91%)
108 536 dTLB-load-misses # 0,20% of all dTLB cache hits (24,91%)
55 540 014 dTLB-loads (24,91%)
26 310 618 dTLB-store-misses (24,91%)
3 412 849 640 dTLB-stores (24,91%)
27 265 942 916 cycles (24,91%)
6,681218065 seconds time elapsed
6,584426000 seconds user
0,096006000 seconds sys
The description of l2_rqsts.all_rfo
is
Counts the total number of RFO (read for ownership) requests to L2 cache. L2 RFO requests include both L1D demand RFO misses as well as L1D RFO prefetches.
suggesting that DCU can do some sort of RFO prefetches. It was not clear from the desctiption of DCU from Intel Optimization Manual/2.6.2.4
:
Data cache unit (DCU) prefetcher — This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.
So I guess that DCU follows the "access type": If it is RFO then DCU does RFO prefetch.
All of those RFO prefetches should go to L2 along with demand RFO and only some of them (l2_rqsts.rfo_miss
) should go to the uncore. The offcore_requests.demand_rfo
counts only demand rfo, but l2_rqsts.rfo_miss
accounts all rfo (demand + dcu prefectch) meaning that the inequality offcore_requests.demand_rfo < l2_rqsts.rfo_miss
should be held.
QUESTION 1: Why is l2_rqsts.rfo_miss
much less then offcore_requests.demand_rfo
(even l2_rqsts.all_rfo
less then offcore_requests.demand_rfo
)
I expected that demand offcore_requests.demand_rfo
can be matched up with offcore_response.demand_rfo.any_response
so there should be approximately equal numbers for those Core PMU events
QUESTION 2: Why is offcore_response.demand_rfo.any_response
almost 1.5 times more then offcore_requests.demand_rfo
?
I'm guessing that L2-streamer also does some RFO prefetches, but it should not be accounted in offcore_requests.demand_rfo
anyway.
UPD:
$ sudo rdmsr -p 3 0x1A4
1
L2-Streamer off
Performance counter stats for 'taskset -c 3 ./bin':
1 672 633 985 L1-dcache-load-misses # 2272,75% of all L1-dcache hits (24,96%)
73 595 056 L1-dcache-loads (25,00%)
3 409 928 481 L1-dcache-stores (25,00%)
1 593 190 436 l2_rqsts.all_rfo (25,04%)
16 582 758 l2_rqsts.rfo_hit (25,07%)
1 579 107 608 l2_rqsts.rfo_miss (25,07%)
124 294 129 l2_rqsts.all_pf (25,07%)
22 674 837 l2_rqsts.pf_hit (25,07%)
102 019 160 l2_rqsts.pf_miss (25,07%)
1 661 232 864 offcore_requests.demand_rfo (25,02%)
3 287 688 173 offcore_response.demand_rfo.any_response (24,98%)
139 247 dTLB-load-misses # 0,25% of all dTLB cache hits (24,94%)
56 823 458 dTLB-loads (24,90%)
26 343 286 dTLB-store-misses (24,90%)
3 384 264 241 dTLB-stores (24,94%)
37 782 766 410 cycles (24,94%)
9,320791474 seconds time elapsed
9,213383000 seconds user
0,099928000 seconds sys
As can be seen offcore_requests.demand_rfo
got closer to l2_rqsts.rfo_miss
, but still there is some difference. In the Intel docs of OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD
I found the following:
Note: A prefetch promoted to Demand is counted from the promotion point.
So my guess is that L2-prefetches were promoted to Demand and counted in the Demand offcore requests. But it does not explain the difference between offcore_response.demand_rfo.any_response
and offcore_requests.demand_rfo
which is almost twice now:
offcore_requests.demand_rfo 1 661 232 864
vs
offcore_response.demand_rfo.any_response 3 287 688 173
UPD:
$ sudo rdmsr -p 3 0x1A4
3
All L2 prefetchers off
Performance counter stats for 'taskset -c 3 ./bin':
1 686 560 752 L1-dcache-load-misses # 2138,14% of all L1-dcache hits (23,44%)
78 879 830 L1-dcache-loads (23,48%)
3 409 552 015 L1-dcache-stores (23,53%)
1 670 187 931 l2_rqsts.all_rfo (23,56%)
15 674 l2_rqsts.rfo_hit (23,59%)
1 676 538 346 l2_rqsts.rfo_miss (23,58%)
156 206 l2_rqsts.all_pf (23,59%)
14 436 l2_rqsts.pf_hit (23,59%)
173 163 l2_rqsts.pf_miss (23,59%)
1 671 606 174 offcore_requests.demand_rfo (23,59%)
3 301 546 970 offcore_response.demand_rfo.any_response (23,59%)
140 335 dTLB-load-misses # 0,21% of all dTLB cache hits (23,57%)
68 010 546 dTLB-loads (23,53%)
26 329 766 dTLB-store-misses (23,49%)
3 429 416 286 dTLB-stores (23,45%)
39 462 328 435 cycles (23,42%)
9,699770319 seconds time elapsed
9,596304000 seconds user
0,099961000 seconds sys
Now the total number of prefetch requests to l2 (from all prefetchers) is 156 206 l2_rqsts.all_pf
.
UPD:
$ sudo rdmsr -p 3 0x1A4
7
̶A̶l̶l̶ ̶p̶r̶e̶f̶e̶t̶c̶h̶e̶r̶s̶ ̶t̶u̶r̶n̶e̶d̶ ̶o̶f̶f̶.̶ Only IP prefetcher enabled
Performance counter stats for 'taskset -c 3 ./bin':
1 672 643 256 L1-dcache-load-misses # 1893,36% of all L1-dcache hits (24,92%)
88 342 382 L1-dcache-loads (24,96%)
3 411 575 868 L1-dcache-stores (25,00%)
1 672 628 218 l2_rqsts.all_rfo (25,04%)
10 585 l2_rqsts.rfo_hit (25,04%)
1 684 510 576 l2_rqsts.rfo_miss (25,04%)
10 042 l2_rqsts.all_pf (25,04%)
4 368 l2_rqsts.pf_hit (25,05%)
9 135 l2_rqsts.pf_miss (25,05%)
1 684 136 160 offcore_requests.demand_rfo (25,05%)
3 316 673 543 offcore_response.demand_rfo.any_response (25,05%)
133 322 dTLB-load-misses # 0,21% of all dTLB cache hits (25,03%)
64 283 883 dTLB-loads (24,99%)
26 195 527 dTLB-store-misses (24,95%)
3 392 779 428 dTLB-stores (24,91%)
39 627 346 050 cycles (24,88%)
9,710779347 seconds time elapsed
9,610209000 seconds user
0,099981000 seconds sys
UPD:
$ sudo rdmsr -p 3 0x1A4
f
All prefetchers disabled
Performance counter stats for 'taskset -c 3 ./bin':
1 695 710 457 L1-dcache-load-misses # 2052,21% of all L1-dcache hits (23,47%)
82 628 503 L1-dcache-loads (23,47%)
3 429 579 614 L1-dcache-stores (23,47%)
1 682 110 906 l2_rqsts.all_rfo (23,51%)
12 315 l2_rqsts.rfo_hit (23,55%)
1 672 591 830 l2_rqsts.rfo_miss (23,55%)
0 l2_rqsts.all_pf (23,55%)
0 l2_rqsts.pf_hit (23,55%)
12 l2_rqsts.pf_miss (23,55%)
1 662 163 396 offcore_requests.demand_rfo (23,55%)
3 282 743 626 offcore_response.demand_rfo.any_response (23,55%)
126 739 dTLB-load-misses # 0,21% of all dTLB cache hits (23,55%)
59 790 090 dTLB-loads (23,55%)
26 373 257 dTLB-store-misses (23,55%)
3 426 860 516 dTLB-stores (23,55%)
38 282 401 051 cycles (23,51%)
9,377335173 seconds time elapsed
9,281050000 seconds user
0,096010000 seconds sys
Even though prefetchers are disabled perf
reports 12
as pf_miss
(reproducible across different runs with different values). This is probably counting error. Also 1 672 591 830 l2_rqsts.rfo_miss
has slightly larger value then 1 662 163 396 offcore_requests.demand_rfo
which I also tend to interpret as counting error.
Hypothesis: DCU RFO Prefetch missing L2 and going off core are accounted in offcore_requests.demand_rfo
.
The hypothesis works if L2-streamer switched off: 102 019 160 l2_rqsts.pf_miss + 1 579 107 608 l2_rqsts.rfo_miss = 1 681 126 768
; 1 661 232 864 offcore_requests.demand_rfo
The hypothesis also works if all the prefetchers turned off: 1 684 510 576 l2_rqsts.rfo_miss
; 1 684 136 160 offcore_requests
.
In case of all PF turned off L1-dcache-load-misses
is approximately equal to l2_rqsts.rfo_miss
which in turns equals to offcore_requests.demand_rfo
The thing I still have no idea about is why offcore_response.demand_rfo.any_response
has much larger value then offcore_requests.demand_rfo
For Question 1, the answer (on Skylake at least, but very likely to be the same for Whisky lake) is that the L2 RFO events don't count when they are initiated by a prefetch: not when the prefetch is triggered and not even when the RFO later hits or misses in the L2. You can count these events by setting the prefetch bit on your event (set 0x10 in the umask) and in this case you'll see double counts as described here.
The events you see are a somewhat random subset of RFOs where the L2 prefetcher didn't help. The offcore counters apparently don't have such a problem: even if a request is initiated by a prefetch, it can be promoted to a demand request when a demand hits the request in progress.
You can find additional details here, and you should double check exactly what events your version of perf uses as Intel changed the event definitions as described in that last link.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With