Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does sequential array access have a high cache miss rate?

I have the following C code that I am testing to understand perf and caching. It sequentially accesses an array of doubles.

// test.c

#include <stdio.h>
#include <stdlib.h>
#include <emmintrin.h>

int main(int argc, char **argv)
{
    size_t n = 10000000;
    double* arr;
    if (posix_memalign((void**)&arr, 64, n * sizeof(double)) != 0) {
        fprintf(stderr, "posix_memalign failed\n");
        exit(1);
    }

    for (size_t i = 0; i < n; i += 64) {
        _mm_clflush(&arr[i]);
    }

    for (size_t i = 0; i < n; i++) arr[i] = (double)i;

    double sum = 0;
    for (size_t i = 0; i < n; i++) sum += arr[i];

    printf("Sum: %f\n", sum);
    free(arr);

    return 0;
}

I compile the code without optimization, and run perf.

$ gcc -o test test.c
$ perf stat -e LLC-loads,LLC-loads-misses -- ./test
Sum: 49999995000000.000000

 Performance counter stats for 'test':

            94,765      LLC-loads:u
            91,979      LLC-loads-misses:u        #   97.06% of all LL-cache accesses

       0.082974254 seconds time elapsed

       0.047802000 seconds user
       0.034893000 seconds sys

A few questions puzzle me.

  • First, LLC-loads is much smaller than expected. Should it be close to 10000000/8 (cache line: 64 bytes)? Or is it possible that it doesn't profile the whole duration of runtime?

  • Second and more importantly, the miss rate is too high. I expect each read miss will bring a cache line with 8 doubles. So overall, the miss rate should come close to 1/8? Actually, it should be much smaller as pre-fetching will help a lot in this case, right?

BTW, below is the basic info.

$ gcc --version
gcc (GCC) 15.2.0
Copyright (C) 2025 Free Software Foundation, Inc.

$ perf --version
perf version 4.18.0-553.84.1.el8_10.x86_64

$ lscpu | grep "Model name"
Model name:          Intel(R) Xeon(R) Gold 6430

$ lscpu | grep "L3 cache"
L3 cache:            61440K
like image 873
user180574 Avatar asked Dec 15 '25 21:12

user180574


1 Answers

While it's impossible to say for certain without knowing your exact CPU model and without seeing the actual machine code that runs on your CPU, your CPU likely doesn't have enough cache.

Given

size_t n = 10000000

and 8 bytes/double, that's 80,000,000 bytes of memory - close to 80 MB.

That almost certainly doesn't fit into whatever cache your CPU has.

A top-end CPU, such as an AMD Ryzen Threadripper PRO 9995WX has the following cache:

Cache per CPU Package:

L1 Instruction Cache: 96 x 32 KB

L1 Data Cache: 96 x 48 KB

L2 Cache: 96 x 1024 KB

L3 Cache: 384 MB

Most likely, you're not running on a CPU like that.

A more typical CPU, perhaps something like an Intel Xeon Gold 6230R from a few years ago, probably has cache somewhat like this:

Cache per CPU Package:

L1 Instruction Cache: 26 x 32 KB

L1 Data Cache: 26 x 32 KB

L2 Cache: 26 x 1024 KB

L3 Cache: 36 MB

80 MB isn't going to fit into that even if every element has its own cache line.

After this loop

for (size_t i = 0; i < n; i++) arr[i] = (double)i;

the elements at the beginning of the array have almost certainly been overwritten in all of the L1, L2, and L3 caches, so when this runs

double sum = 0;
for (size_t i = 0; i < n; i++) sum += arr[i];

the data has to be retrieved all the way from RAM.

EDIT:

The CPU is an Intel(R) Xeon(R) Gold 6430, with the following cache:

Cache per CPU Package:

L1 Instruction Cache: 32 x 32 KB

L1 Data Cache: 32 x 48 KB

L2 Cache: 32 x 2048 KB

L3 Cache: 60 MB

It'd be interesting to see the cache hit rate for a smaller value of n, perhaps

size_t n = 5000000
like image 81
Andrew Henle Avatar answered Dec 17 '25 21:12

Andrew Henle



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!