I want to write a program to get my cache size(L1, L2, L3). I know the general idea of it.
So I wrote a little program. Here's my code:
#include <cstdio>
#include <time.h>
#include <sys/mman.h>
const int KB = 1024;
const int MB = 1024 * KB;
const int data_size = 32 * MB;
const int repeats = 64 * MB;
const int steps = 8 * MB;
const int times = 8;
long long clock_time() {
struct timespec tp;
clock_gettime(CLOCK_REALTIME, &tp);
return (long long)(tp.tv_nsec + (long long)tp.tv_sec * 1000000000ll);
}
int main() {
// allocate memory and lock
void* map = mmap(NULL, (size_t)data_size, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
if (map == MAP_FAILED) {
return 0;
}
int* data = (int*)map;
// write all to avoid paging on demand
for (int i = 0;i< data_size / sizeof(int);i++) {
data[i]++;
}
int steps[] = { 1*KB, 4*KB, 8*KB, 16*KB, 24 * KB, 32*KB, 64*KB, 128*KB,
128*KB*2, 128*KB*3, 512*KB, 1 * MB, 2 * MB, 3 * MB, 4 * MB,
5 * MB, 6 * MB, 7 * MB, 8 * MB, 9 * MB};
for (int i = 0; i <= sizeof(steps) / sizeof(int) - 1; i++) {
double totalTime = 0;
for (int k = 0; k < times; k++) {
int size_mask = steps[i] / sizeof(int) - 1;
long long start = clock_time();
for (int j = 0; j < repeats; j++) {
++data[ (j * 16) & size_mask ];
}
long long end = clock_time();
totalTime += (end - start) / 1000000000.0;
}
printf("%d time: %lf\n", steps[i] / KB, totalTime);
}
munmap(map, (size_t)data_size);
return 0;
}
However, the result is so weird:
1 time: 1.989998
4 time: 1.992945
8 time: 1.997071
16 time: 1.993442
24 time: 1.994212
32 time: 2.002103
64 time: 1.959601
128 time: 1.957994
256 time: 1.975517
384 time: 1.975143
512 time: 2.209696
1024 time: 2.437783
2048 time: 7.006168
3072 time: 5.306975
4096 time: 5.943510
5120 time: 2.396078
6144 time: 4.404022
7168 time: 4.900366
8192 time: 8.998624
9216 time: 6.574195
My CPU is Intel(R) Core(TM) i3-2350M. L1 Cache: 32K (for data), L2 Cache 256K, L3 Cache 3072K. Seems like it doesn't follow any rule. I can't get information of cache size or cache level from that. Could anybody give some help? Thanks in advance.
Update:
Follow @Leeor advice, I use j*64
instead of j*16
. New results:
1 time: 1.996282
4 time: 2.002579
8 time: 2.002240
16 time: 1.993198
24 time: 1.995733
32 time: 2.000463
64 time: 1.968637
128 time: 1.956138
256 time: 1.978266
384 time: 1.991912
512 time: 2.192371
1024 time: 2.262387
2048 time: 3.019435
3072 time: 2.359423
4096 time: 5.874426
5120 time: 2.324901
6144 time: 4.135550
7168 time: 3.851972
8192 time: 7.417762
9216 time: 2.272929
10240 time: 3.441985
11264 time: 3.094753
Two peaks, 4096K and 8192K. Still weird.
I'm not sure if this is the only problem here, but it's definitely the biggest one - your code would very quickly trigger the HW stream prefetchers, making you almost always hit in L1 or L2 latencies.
More details can be found here - http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers
For your benchmark You should either disable them (through BIOS or any other means), or at least make your steps longer by replacing j*16
(* 4 bytes per int = 64B, one cache line - a classic unit stride for the stream detector), with j*64
(4 cache lines). The reason being - the prefetcher can issue 2 prefetches per stream request, so it runs ahead of your code when you do unit strides, may still get a bit ahead of you when your code is jumping over 2 lines, but become mostly useless with longer jumps (3 isn't good because of your modulu, you need a divider of step_size)
Update the questions with the new results and we can figure out if there's anything else here.
EDIT1: Ok, I ran the fixed code and got -
1 time: 1.321001
4 time: 1.321998
8 time: 1.336288
16 time: 1.324994
24 time: 1.319742
32 time: 1.330685
64 time: 1.536644
128 time: 1.536933
256 time: 1.669329
384 time: 1.592145
512 time: 2.036315
1024 time: 2.214269
2048 time: 2.407584
3072 time: 2.259108
4096 time: 2.584872
5120 time: 2.203696
6144 time: 2.335194
7168 time: 2.322517
8192 time: 5.554941
9216 time: 2.230817
It makes much more sense if you ignore a few columns - you jump after the 32k (L1 size), but instead of jumping after 256k (L2 size), we get too good of a result for 384, and jump only at 512k. Last jump is at 8M (my LLC size), but 9k is broken again.
This allows us to spot the next error - ANDing with size mask only makes sense when it's a power of 2, otherwise you don't wrap around, but instead repeat some of the last addresses again (which ends up in optimistic results since it's fresh in the cache).
Try replacing the ... & size_mask
with % steps[i]/sizeof(int)
, the modulu is more expensive but if you want to have these sizes you need it (or alternatively, a running index that gets zeroed whenever it exceeds the current size)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With