Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

L2 instruction fetch misses much higher than L1 instruction fetch misses

I am generating a synthetic C benchmark aimed at causing a large number of instruction fetch misses via the following Python script:

#!/usr/bin/env python
import tempfile
import random
import sys

if __name__ == '__main__':
    functions = list()

    for i in range(10000):
        func_name = "f_{}".format(next(tempfile._get_candidate_names()))
        sys.stdout.write("void {}() {{\n".format(func_name))
        sys.stdout.write("    double pi = 3.14, r = 50, h = 100, e = 2.7, res;\n")
        sys.stdout.write("    res = pi*r*r*h;\n")
        sys.stdout.write("    res = res/(e*e);\n")
        sys.stdout.write("}\n")
        functions.append(func_name)


    sys.stdout.write("int main() {\n")
    sys.stdout.write("unsigned int i;\n")
    sys.stdout.write("for(i =0 ; i < 100000 ;i ++ ){\n")
    for i in range(10000):
        r = random.randint(0, len(functions)-1)
        sys.stdout.write("{}();\n".format(functions[r]))


    sys.stdout.write("}\n")
    sys.stdout.write("}\n")

What the code does is simply generating a C program that consists of a lot of randomly named dummy functions that are in turn called in random order in main(). I am compiling the resulting code with gcc 4.8.5 under CentOS 7 with -O0. The code is running on a dual socket machine fitted with 2x Intel Xeon E5-2630v3 (Haswell architecture).

What I am interested in is understanding instruction-related counters reported by perf when profiling the binary compiled from the C code (not the Python script, that is only used to automatically generate the code). In particular, I am observing the following counters with perf stat:

  • instructions
  • L1-icache-load-misses (instruction fetches that miss L1, aka r0280 on Haswell)
  • r2424, L2_RQSTS.CODE_RD_MISS (instruction fetches that miss L2)
  • rf824, L2_RQSTS.ALL_PF (all L2 hardware prefetcher requests, both code and data)

I first profiled the code with all hardware prefetchers disabled in the BIOS, i.e.

  • MLC Streamer Disabled
  • MLC Spatial Prefetcher Disabled
  • DCU Data Prefetcher Disabled
  • DCU Instruction Prefetcher Disabled

and the results are the following (process is pinned to first core of second CPU and corresponding NUMA domain, but I guess this doesn't make much difference):

perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code   

 Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':    

    25,108,610,204      instructions                                               
     2,613,075,664      L1-icache-load-misses                                       
     5,065,167,059      r2424                                                       
                17      rf824                                                       

      33.696954142 seconds time elapsed 

Considering the figures above, I cannot explain such a high number of instruction fetch misses in L2. I have disabled all prefetchers, and L2_RQSTS.ALL_PF confirms so. But why do I see twice as much the number of instruction fetch misses in L2 than in L1i? In my (simple) mental processor model, if an instruction is looked up in L2, it must have necessarily been looked up in L1i before. Clearly I am wrong, what am I missing?

I then tried to run the same code with all the hardware prefetchers enabled, i.e.

  • MLC Streamer Enabled
  • MLC Spatial Prefetcher Enabled
  • DCU Data Prefetcher Enabled
  • DCU Instruction Prefetcher Enabled

and the results are the following:

perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code

 Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':    

    25,109,877,626      instructions                                               
     2,599,883,072      L1-icache-load-misses                                       
     5,054,883,231      r2424                                                       
           908,494      rf824

Now, L2_RQSTS.ALL_PF seems to indicate that something more is happening and although I expected the prefetcher to be a bit more aggressive, I imagine that the instruction prefetcher is severely put to the test due to the jump-intensive type of workload and data prefetcher has not much to do with this kind of workload. But again, L2_RQSTS.CODE_RD_MISS is still too high with the prefetchers enabled.

So, to sum up, my question is:

With hardware prefetchers disabled, L2_RQSTS.CODE_RD_MISS seems to be much higher than L1-icache-load-misses. Even with hardware prefetchers enabled, I still cannot explain it. What is the reason behind such a high count of L2_RQSTS.CODE_RD_MISS compared to L1-icache-load-misses?

like image 811
Marco Guerri Avatar asked Jan 28 '17 13:01

Marco Guerri


People also ask

What is the difference between L1 cache and L2 cache?

L1 is "level-1" cache memory, usually built onto the microprocessor chip itself. For example, the Intel MMX microprocessor comes with 32 thousand bytes of L1. L2 (that is, level-2) cache memory is on a separate chip (possibly on an expansion card) that can be accessed more quickly than the larger "main" memory.

Why is L2 cache slower than L1?

Multiple-Level Caches Modern systems often use at least two levels of caches, as shown in Figure 8.16. The first-level (L1) cache is small enough to provide a one- or two-cycle access time. The second-level (L2) cache is also built from SRAM but is larger, and therefore slower, than the L1 cache.

Is L1 cache faster than L2?

When it comes to speed, the L2 cache lags behind the L1 cache but is still much faster than your system RAM. The L1 memory cache is typically 100 times faster than your RAM, while the L2 cache is around 25 times faster.

What are the hit rates for L1 and L2 caches in the CPU?

A larger, slower, cheaper L2 can provide all the benefits of a large L1, but without the die size and power consumption penalty. Most modern L1 cache rates have hit rates far above the theoretical 50 percent shown here — Intel and AMD both typically field cache hit rates of 95 percent or higher.


1 Answers

The instruction prefetcher can generate requests are that don't count as accesses to the L1I cache, but are counted as code fetch requests at higher-numbered memory levels, such as the L2. This is generally true on all Intel microarchitectures with an instruction prefetcher. L2_RQSTS.CODE_RD_MISS counts both demand and prefetch requests from the L1I. Demand requests are generated by a multiplexing unit in the IFU that chooses a target fetch linear address from among the different units in the pipeline that may change the flow, such as the branch prediction units. Prefetch requests are generated by the L1I instruction prefetcher on an L1I miss if possible.

In general, the number of prefetch fetch requests is nearly proportional to the number of L1I misses. For instruction fetches from memory regions of cacheable memory types, the following formula holds:

ICACHE.MISSES <= L2_RQSTS.CODE_RD_MISS + L2_RQSTS.CODE_RD_HIT

I'm not sure whether this formula also holds for uncacheable fetch requests. I didn't test it in that condition. I know these requests are counted as ICACHE.MISSES, but not sure about the other events.

In your case, most instruction fetches will miss in the L1I and L2. You have 10,000 functions each nearly fully spans 2 64-btye cache lines (here is a version with only two functions), so the code size is much larger than the 256 KiB L2 available on Haswell. The functions are being called in a non-sequential and upredictable order, so the L1I and L2 prefetchers won't significantly help. The only noteworthy exception are returns, all of which will be predicted correctly using the RSB mechanism.

Each of the 10,000 functions are being called 100,000 times in a loop. Most fetch requests are for lines occupied by these functions. The total number of useful instruction fetch requests is about 2 lines per function * 10,000 function * 100,000 iterations = 2,000,000,000 lines, most of which will miss in the L1I and L2 (but probably hit in the L3 after the first cold iteration). Several millions of other requests will be for lines occupied by the loop body. Your measurements show that there are about 30% more instruction fetches that miss in the L1I. This is because of branch mispredictions, which cause fetch requests for incorrect lines that may not be even be in the L1I and/or L2. Each L1I miss may trigger a prefetch, so it's normal for L2 instruction fetches to be within two times the number of L1I misses. This is consistent with your numbers.

In my two-function version, I'm counting 24 instructions per invoked function, so I expect the total number of retired instructions to be approximately 24 billion, but you got 25 billion. Either I don't know how to count, or you have 25 instructions per function for some reason.

like image 70
Hadi Brais Avatar answered Sep 30 '22 16:09

Hadi Brais