L2 instruction fetch misses much higher than L1 instruction fetch misses

Tags:

I am generating a synthetic C benchmark aimed at causing a large number of instruction fetch misses via the following Python script:

#!/usr/bin/env python
import tempfile
import random
import sys

if __name__ == '__main__':
    functions = list()

    for i in range(10000):
        func_name = "f_{}".format(next(tempfile._get_candidate_names()))
        sys.stdout.write("void {}() {{\n".format(func_name))
        sys.stdout.write("    double pi = 3.14, r = 50, h = 100, e = 2.7, res;\n")
        sys.stdout.write("    res = pi*r*r*h;\n")
        sys.stdout.write("    res = res/(e*e);\n")
        sys.stdout.write("}\n")
        functions.append(func_name)


    sys.stdout.write("int main() {\n")
    sys.stdout.write("unsigned int i;\n")
    sys.stdout.write("for(i =0 ; i < 100000 ;i ++ ){\n")
    for i in range(10000):
        r = random.randint(0, len(functions)-1)
        sys.stdout.write("{}();\n".format(functions[r]))


    sys.stdout.write("}\n")
    sys.stdout.write("}\n")

What the code does is simply generating a C program that consists of a lot of randomly named dummy functions that are in turn called in random order in main(). I am compiling the resulting code with gcc 4.8.5 under CentOS 7 with -O0. The code is running on a dual socket machine fitted with 2x Intel Xeon E5-2630v3 (Haswell architecture).

What I am interested in is understanding instruction-related counters reported by perf when profiling the binary compiled from the C code (not the Python script, that is only used to automatically generate the code). In particular, I am observing the following counters with perf stat:

instructions
L1-icache-load-misses (instruction fetches that miss L1, aka r0280 on Haswell)
r2424, L2_RQSTS.CODE_RD_MISS (instruction fetches that miss L2)
rf824, L2_RQSTS.ALL_PF (all L2 hardware prefetcher requests, both code and data)

I first profiled the code with all hardware prefetchers disabled in the BIOS, i.e.

MLC Streamer Disabled
MLC Spatial Prefetcher Disabled
DCU Data Prefetcher Disabled
DCU Instruction Prefetcher Disabled

and the results are the following (process is pinned to first core of second CPU and corresponding NUMA domain, but I guess this doesn't make much difference):

perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code   

 Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':    

    25,108,610,204      instructions                                               
     2,613,075,664      L1-icache-load-misses                                       
     5,065,167,059      r2424                                                       
                17      rf824                                                       

      33.696954142 seconds time elapsed

Considering the figures above, I cannot explain such a high number of instruction fetch misses in L2. I have disabled all prefetchers, and L2_RQSTS.ALL_PF confirms so. But why do I see twice as much the number of instruction fetch misses in L2 than in L1i? In my (simple) mental processor model, if an instruction is looked up in L2, it must have necessarily been looked up in L1i before. Clearly I am wrong, what am I missing?

I then tried to run the same code with all the hardware prefetchers enabled, i.e.

MLC Streamer Enabled
MLC Spatial Prefetcher Enabled
DCU Data Prefetcher Enabled
DCU Instruction Prefetcher Enabled

and the results are the following:

perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code

 Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':    

    25,109,877,626      instructions                                               
     2,599,883,072      L1-icache-load-misses                                       
     5,054,883,231      r2424                                                       
           908,494      rf824

Now, L2_RQSTS.ALL_PF seems to indicate that something more is happening and although I expected the prefetcher to be a bit more aggressive, I imagine that the instruction prefetcher is severely put to the test due to the jump-intensive type of workload and data prefetcher has not much to do with this kind of workload. But again, L2_RQSTS.CODE_RD_MISS is still too high with the prefetchers enabled.

So, to sum up, my question is:

With hardware prefetchers disabled, L2_RQSTS.CODE_RD_MISS seems to be much higher than L1-icache-load-misses. Even with hardware prefetchers enabled, I still cannot explain it. What is the reason behind such a high count of L2_RQSTS.CODE_RD_MISS compared to L1-icache-load-misses?

811

asked Jan 28 '17 13:01

Marco Guerri

1 Answers

The instruction prefetcher can generate requests are that don't count as accesses to the L1I cache, but are counted as code fetch requests at higher-numbered memory levels, such as the L2. This is generally true on all Intel microarchitectures with an instruction prefetcher. L2_RQSTS.CODE_RD_MISS counts both demand and prefetch requests from the L1I. Demand requests are generated by a multiplexing unit in the IFU that chooses a target fetch linear address from among the different units in the pipeline that may change the flow, such as the branch prediction units. Prefetch requests are generated by the L1I instruction prefetcher on an L1I miss if possible.

In general, the number of prefetch fetch requests is nearly proportional to the number of L1I misses. For instruction fetches from memory regions of cacheable memory types, the following formula holds:

ICACHE.MISSES <= L2_RQSTS.CODE_RD_MISS + L2_RQSTS.CODE_RD_HIT

I'm not sure whether this formula also holds for uncacheable fetch requests. I didn't test it in that condition. I know these requests are counted as ICACHE.MISSES, but not sure about the other events.

In your case, most instruction fetches will miss in the L1I and L2. You have 10,000 functions each nearly fully spans 2 64-btye cache lines (here is a version with only two functions), so the code size is much larger than the 256 KiB L2 available on Haswell. The functions are being called in a non-sequential and upredictable order, so the L1I and L2 prefetchers won't significantly help. The only noteworthy exception are returns, all of which will be predicted correctly using the RSB mechanism.

Each of the 10,000 functions are being called 100,000 times in a loop. Most fetch requests are for lines occupied by these functions. The total number of useful instruction fetch requests is about 2 lines per function * 10,000 function * 100,000 iterations = 2,000,000,000 lines, most of which will miss in the L1I and L2 (but probably hit in the L3 after the first cold iteration). Several millions of other requests will be for lines occupied by the loop body. Your measurements show that there are about 30% more instruction fetches that miss in the L1I. This is because of branch mispredictions, which cause fetch requests for incorrect lines that may not be even be in the L1I and/or L2. Each L1I miss may trigger a prefetch, so it's normal for L2 instruction fetches to be within two times the number of L1I misses. This is consistent with your numbers.

In my two-function version, I'm counting 24 instructions per invoked function, so I expect the total number of retired instructions to be approximately 24 billion, but you got 25 billion. Either I don't know how to count, or you have 25 instructions per function for some reason.

answered Sep 30 '22 16:09

Hadi Brais

Related questions
                            
                                How to speed up OpenGrok indexing
                            
                                How to join a data.table with multiple columns and multiple values
                            
                                Speeding up Matlab Engine calls
                            
                                MySQL Update after N rows
                            
                                Why is processing multiple streams of data slower than processing one?
                            
                                Does PHP silently optimize consecutive fseek-commands into one fseek command?
                            
                                Slow SBT start up with many projects
                            
                                How can I optimize this PostgreSQL query that updates every row?
                            
                                Does allocation performance degrade on a large number of live instances when using G1?
                            
                                (self) join by time intervals
                            
                                How can my variable declarations affect execution time
                            
                                Performance is very bad when set label with Thai text in Unity
                            
                                Rails migration extremely slow even when there are no pending migrations
                            
                                How do you access an image's metadata without causing a leak in native memory?
                            
                                Angular 2 change detection through event emitter consumes massive CPU time
                            
                                File uploads slower than user's network upload speed on Apache (EC2)
                            
                                Timings of cURL request by getting details from curl_getinfo()
                            
                                Avoid VBCSCompiler perf hit on Roslyn powered ASP.NET Razor MVC views?
                            
                                Android studio takes too long to open
                            
                                Tensorflow training becomes slower and slower when iteration is more than 10,000. Why?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

L2 instruction fetch misses much higher than L1 instruction fetch misses

Tags:

performance

cpu-architecture

cpu-cache

intel

perf

Marco Guerri

People also ask

1 Answers

Hadi Brais

Recent Activity

Donate For Us