Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why modifying an instruction cause huge i-cache and i-TLB misses on x86?

The following code fragment creates a function (fun) with just one RET instruction. The loop repeatedly calls the function and overwrites the contents of the RET instruction after returning.

#include <sys/mman.h>
#include<stdlib.h>
#include<unistd.h>
#include <string.h>

typedef void (*foo)();
#define RET (0xC3)

int main(){
     // Allocate an executable page
    char * ins = (char *) mmap(0, 4096, PROT_EXEC|PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, 0, 0);
    // Just write a RET instruction
    *ins = RET;
    // make fun point to the function with just RET instruction
    foo fun = (foo)(ins);
    // Repeat 0xfffffff times
    for(long i = 0; i < 0xfffffff; i++){
        fun();
        *ins = RET;
    }
    return 0;
}

The Linux perf on X86 Broadwell machine has the following icache and iTLB statistics:

perf stat -e L1-icache-load-misses -e iTLB-load-misses ./a.out

Performance counter stats for './a.out':

   805,516,067      L1-icache-load-misses                                       
         4,857      iTLB-load-misses                                            

  32.052301220 seconds time elapsed

Now, look at the same code without overwriting the RET instruction.

#include <sys/mman.h>
#include<stdlib.h>
#include<unistd.h>
#include <string.h>

typedef void (*foo)();
#define RET (0xC3)

int main(){
    // Allocate an executable page
    char * ins = (char *) mmap(0, 4096, PROT_EXEC|PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, 0, 0);
    // Just write a RET instruction
    *ins = RET;
    // make fun point to the function with just RET instruction
    foo fun = (foo)(ins);
    // Repeat 0xfffffff times
    for(long i = 0; i < 0xfffffff; i++){
        fun();
        // Commented *ins = RET;
    }
    return 0;
}

And here is the perf statistics on the same machine.

perf stat -e L1-icache-load-misses -e iTLB-load-misses ./a.out

Performance counter stats for './a.out':

        11,738      L1-icache-load-misses                                       
           425      iTLB-load-misses                                            

   0.773433500 seconds time elapsed

Notice that overwriting the instruction causes L1-icache-load-misses to grow from 11,738 to 805,516,067 -- a manifold growth. Also notice that iTLB-load-misses grows from 425 to 4,857--quite a growth but less compared to L1-icache-load-misses. The running time grows from 0.773433500 seconds to 32.052301220 seconds -- a 41x growth!

It is unclear why the CPU should cause i-cache misses if the instruction footprint is so small. The only difference in the two examples is that the instruction is modified. Granted the L1 iCache and dCache are separate, isn't there a way to install code into iCache so that the cache i-cache misses can be avoided?

Furthermore, why is there a 10x growth in the iTLB misses?

like image 900
user1205476 Avatar asked Sep 16 '18 17:09

user1205476


1 Answers

Granted the L1 iCache and dCache are separate, isn't there a way to install code into iCache so that the cache i-cache misses can be avoided?

No.

If you want to modify code - the only path this can go is the following:

  1. Store Date Execution Engine
  2. Store Buffer & Forwarding
  3. L1 Data Cache
  4. Unified L2 Cache
  5. L1 Instruction Cache

Note that you are also missing out on the μOP Cache.

This is illustrated by this diagram1, which I believe is sufficiently accurate.

I would suspect the iTLB misses could be due to regular TLB flushes. In case of no modification you are not affected by iTLB misses because your instructions actually come from the μOP Cache.

If they don't, I'm not quite sure. I would think the L1 Instruction Cache is virtually addressed, so no need to access the TLB if there is a hit.

1: unfortunately the image has a very restrictive copyright, so I refrain from highlighting the path / inlining the image.

like image 135
Zulan Avatar answered Sep 18 '22 16:09

Zulan