The following code fragment creates a function (fun) with just one RET instruction. The loop repeatedly calls the function and overwrites the contents of the RET instruction after returning.
#include <sys/mman.h>
#include<stdlib.h>
#include<unistd.h>
#include <string.h>
typedef void (*foo)();
#define RET (0xC3)
int main(){
// Allocate an executable page
char * ins = (char *) mmap(0, 4096, PROT_EXEC|PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, 0, 0);
// Just write a RET instruction
*ins = RET;
// make fun point to the function with just RET instruction
foo fun = (foo)(ins);
// Repeat 0xfffffff times
for(long i = 0; i < 0xfffffff; i++){
fun();
*ins = RET;
}
return 0;
}
The Linux perf on X86 Broadwell machine has the following icache and iTLB statistics:
perf stat -e L1-icache-load-misses -e iTLB-load-misses ./a.out
Performance counter stats for './a.out':
805,516,067 L1-icache-load-misses
4,857 iTLB-load-misses
32.052301220 seconds time elapsed
Now, look at the same code without overwriting the RET instruction.
#include <sys/mman.h>
#include<stdlib.h>
#include<unistd.h>
#include <string.h>
typedef void (*foo)();
#define RET (0xC3)
int main(){
// Allocate an executable page
char * ins = (char *) mmap(0, 4096, PROT_EXEC|PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, 0, 0);
// Just write a RET instruction
*ins = RET;
// make fun point to the function with just RET instruction
foo fun = (foo)(ins);
// Repeat 0xfffffff times
for(long i = 0; i < 0xfffffff; i++){
fun();
// Commented *ins = RET;
}
return 0;
}
And here is the perf statistics on the same machine.
perf stat -e L1-icache-load-misses -e iTLB-load-misses ./a.out
Performance counter stats for './a.out':
11,738 L1-icache-load-misses
425 iTLB-load-misses
0.773433500 seconds time elapsed
Notice that overwriting the instruction causes L1-icache-load-misses to grow from 11,738 to 805,516,067 -- a manifold growth. Also notice that iTLB-load-misses grows from 425 to 4,857--quite a growth but less compared to L1-icache-load-misses. The running time grows from 0.773433500 seconds to 32.052301220 seconds -- a 41x growth!
It is unclear why the CPU should cause i-cache misses if the instruction footprint is so small. The only difference in the two examples is that the instruction is modified. Granted the L1 iCache and dCache are separate, isn't there a way to install code into iCache so that the cache i-cache misses can be avoided?
Furthermore, why is there a 10x growth in the iTLB misses?
Granted the L1 iCache and dCache are separate, isn't there a way to install code into iCache so that the cache i-cache misses can be avoided?
No.
If you want to modify code - the only path this can go is the following:
Note that you are also missing out on the μOP Cache.
This is illustrated by this diagram1, which I believe is sufficiently accurate.
I would suspect the iTLB misses could be due to regular TLB flushes. In case of no modification you are not affected by iTLB misses because your instructions actually come from the μOP Cache.
If they don't, I'm not quite sure. I would think the L1 Instruction Cache is virtually addressed, so no need to access the TLB if there is a hit.
1: unfortunately the image has a very restrictive copyright, so I refrain from highlighting the path / inlining the image.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With