Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Measuring Cache Latencies

People also ask

How is memory latency measured?

To calculate a module's latency, multiply clock cycle duration by the total number of clock cycles. These numbers will be noted in official engineering documentation on a module's data sheet.

How do I check my CPU cache speed?

On the Command Prompt screen, type wmic cpu get L2CacheSize, L3CacheSize and press the Enter key on the keyboard of your computer. 3. Once the command is executed, you will find L2, L3 Cache size information displayed on the screen.

What is memory latency in computers?

Memory latency is the time (the latency) between initiating a request for a byte or word in memory until it is retrieved by a processor. If the data are not in the processor's cache, it takes longer to obtain them, as the processor will have to communicate with the external memory cells.


I would rather try to use the hardware clock as a measure. The rdtsc instruction will tell you the current cycle count since the CPU was powered up. Also it is better to use asm to make sure always the same instructions are used in both measured and dry runs. Using that and some clever statistics I have made this a long time ago:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>


int i386_cpuid_caches (size_t * data_caches) {
    int i;
    int num_data_caches = 0;
    for (i = 0; i < 32; i++) {

        // Variables to hold the contents of the 4 i386 legacy registers
        uint32_t eax, ebx, ecx, edx; 

        eax = 4; // get cache info
        ecx = i; // cache id

        asm (
            "cpuid" // call i386 cpuid instruction
            : "+a" (eax) // contains the cpuid command code, 4 for cache query
            , "=b" (ebx)
            , "+c" (ecx) // contains the cache id
            , "=d" (edx)
        ); // generates output in 4 registers eax, ebx, ecx and edx 

        // taken from http://download.intel.com/products/processor/manual/325462.pdf Vol. 2A 3-149
        int cache_type = eax & 0x1F; 

        if (cache_type == 0) // end of valid cache identifiers
            break;

        char * cache_type_string;
        switch (cache_type) {
            case 1: cache_type_string = "Data Cache"; break;
            case 2: cache_type_string = "Instruction Cache"; break;
            case 3: cache_type_string = "Unified Cache"; break;
            default: cache_type_string = "Unknown Type Cache"; break;
        }

        int cache_level = (eax >>= 5) & 0x7;

        int cache_is_self_initializing = (eax >>= 3) & 0x1; // does not need SW initialization
        int cache_is_fully_associative = (eax >>= 1) & 0x1;


        // taken from http://download.intel.com/products/processor/manual/325462.pdf 3-166 Vol. 2A
        // ebx contains 3 integers of 10, 10 and 12 bits respectively
        unsigned int cache_sets = ecx + 1;
        unsigned int cache_coherency_line_size = (ebx & 0xFFF) + 1;
        unsigned int cache_physical_line_partitions = ((ebx >>= 12) & 0x3FF) + 1;
        unsigned int cache_ways_of_associativity = ((ebx >>= 10) & 0x3FF) + 1;

        // Total cache size is the product
        size_t cache_total_size = cache_ways_of_associativity * cache_physical_line_partitions * cache_coherency_line_size * cache_sets;

        if (cache_type == 1 || cache_type == 3) {
            data_caches[num_data_caches++] = cache_total_size;
        }

        printf(
            "Cache ID %d:\n"
            "- Level: %d\n"
            "- Type: %s\n"
            "- Sets: %d\n"
            "- System Coherency Line Size: %d bytes\n"
            "- Physical Line partitions: %d\n"
            "- Ways of associativity: %d\n"
            "- Total Size: %zu bytes (%zu kb)\n"
            "- Is fully associative: %s\n"
            "- Is Self Initializing: %s\n"
            "\n"
            , i
            , cache_level
            , cache_type_string
            , cache_sets
            , cache_coherency_line_size
            , cache_physical_line_partitions
            , cache_ways_of_associativity
            , cache_total_size, cache_total_size >> 10
            , cache_is_fully_associative ? "true" : "false"
            , cache_is_self_initializing ? "true" : "false"
        );
    }

    return num_data_caches;
}

int test_cache(size_t attempts, size_t lower_cache_size, int * latencies, size_t max_latency) {
    int fd = open("/dev/urandom", O_RDONLY);
    if (fd < 0) {
        perror("open");
        abort();
    }
    char * random_data = mmap(
          NULL
        , lower_cache_size
        , PROT_READ | PROT_WRITE
        , MAP_PRIVATE | MAP_ANON // | MAP_POPULATE
        , -1
        , 0
        ); // get some random data
    if (random_data == MAP_FAILED) {
        perror("mmap");
        abort();
    }

    size_t i;
    for (i = 0; i < lower_cache_size; i += sysconf(_SC_PAGESIZE)) {
        random_data[i] = 1;
    }


    int64_t random_offset = 0;
    while (attempts--) {
        // use processor clock timer for exact measurement
        random_offset += rand();
        random_offset %= lower_cache_size;
        int32_t cycles_used, edx, temp1, temp2;
        asm (
            "mfence\n\t"        // memory fence
            "rdtsc\n\t"         // get cpu cycle count
            "mov %%edx, %2\n\t"
            "mov %%eax, %3\n\t"
            "mfence\n\t"        // memory fence
            "mov %4, %%al\n\t"  // load data
            "mfence\n\t"
            "rdtsc\n\t"
            "sub %2, %%edx\n\t" // substract cycle count
            "sbb %3, %%eax"     // substract cycle count
            : "=a" (cycles_used)
            , "=d" (edx)
            , "=r" (temp1)
            , "=r" (temp2)
            : "m" (random_data[random_offset])
            );
        // printf("%d\n", cycles_used);
        if (cycles_used < max_latency)
            latencies[cycles_used]++;
        else 
            latencies[max_latency - 1]++;
    }

    munmap(random_data, lower_cache_size);

    return 0;
} 

int main() {
    size_t cache_sizes[32];
    int num_data_caches = i386_cpuid_caches(cache_sizes);

    int latencies[0x400];
    memset(latencies, 0, sizeof(latencies));

    int empty_cycles = 0;

    int i;
    int attempts = 1000000;
    for (i = 0; i < attempts; i++) { // measure how much overhead we have for counting cyscles
        int32_t cycles_used, edx, temp1, temp2;
        asm (
            "mfence\n\t"        // memory fence
            "rdtsc\n\t"         // get cpu cycle count
            "mov %%edx, %2\n\t"
            "mov %%eax, %3\n\t"
            "mfence\n\t"        // memory fence
            "mfence\n\t"
            "rdtsc\n\t"
            "sub %2, %%edx\n\t" // substract cycle count
            "sbb %3, %%eax"     // substract cycle count
            : "=a" (cycles_used)
            , "=d" (edx)
            , "=r" (temp1)
            , "=r" (temp2)
            :
            );
        if (cycles_used < sizeof(latencies) / sizeof(*latencies))
            latencies[cycles_used]++;
        else 
            latencies[sizeof(latencies) / sizeof(*latencies) - 1]++;

    }

    {
        int j;
        size_t sum = 0;
        for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
            sum += latencies[j];
        }
        size_t sum2 = 0;
        for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
            sum2 += latencies[j];
            if (sum2 >= sum * .75) {
                empty_cycles = j;
                fprintf(stderr, "Empty counting takes %d cycles\n", empty_cycles);
                break;
            }
        }
    }

    for (i = 0; i < num_data_caches; i++) {
        test_cache(attempts, cache_sizes[i] * 4, latencies, sizeof(latencies) / sizeof(*latencies));

        int j;
        size_t sum = 0;
        for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
            sum += latencies[j];
        }
        size_t sum2 = 0;
        for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
            sum2 += latencies[j];
            if (sum2 >= sum * .75) {
                fprintf(stderr, "Cache ID %i has latency %d cycles\n", i, j - empty_cycles);
                break;
            }
        }

    }

    return 0;

}

Output on my Core2Duo:

Cache ID 0:
- Level: 1
- Type: Data Cache
- Total Size: 32768 bytes (32 kb)

Cache ID 1:
- Level: 1
- Type: Instruction Cache
- Total Size: 32768 bytes (32 kb)

Cache ID 2:
- Level: 2
- Type: Unified Cache
- Total Size: 262144 bytes (256 kb)

Cache ID 3:
- Level: 3
- Type: Unified Cache
- Total Size: 3145728 bytes (3072 kb)

Empty counting takes 90 cycles
Cache ID 0 has latency 6 cycles
Cache ID 2 has latency 21 cycles
Cache ID 3 has latency 168 cycles

Widely used classic test for cache latency is iterating over the linked list. It works on modern superscalar/superpipelined CPU and even on Out-of-order cores like ARM Cortex-A9+ and Intel Core 2/ix. This method is used by open-source lmbench - in the test lat_mem_rd (man page) and in CPU-Z latency measurement tool: http://cpuid.com/medias/files/softwares/misc/latency.zip (native Windows binary)

There are sources of lat_mem_rd test from lmbench: https://github.com/foss-for-synopsys-dwc-arc-processors/lmbench/blob/master/src/lat_mem_rd.c

And the main test is

#define ONE p = (char **)*p;
#define FIVE    ONE ONE ONE ONE ONE
#define TEN FIVE FIVE
#define FIFTY   TEN TEN TEN TEN TEN
#define HUNDRED FIFTY FIFTY

void
benchmark_loads(iter_t iterations, void *cookie)
{
    struct mem_state* state = (struct mem_state*)cookie;
    register char **p = (char**)state->p[0];
    register size_t i;
    register size_t count = state->len / (state->line * 100) + 1;

    while (iterations-- > 0) {
        for (i = 0; i < count; ++i) {
            HUNDRED;
        }
    }

    use_pointer((void *)p);
    state->p[0] = (char*)p;
}

So, after deciphering the macro we do a lot of linear operations like:

 p = (char**) *p;  // (in intel syntax) == mov eax, [eax]
 p = (char**) *p;
 p = (char**) *p;
 ....   // 100 times total
 p = (char**) *p;

over the memory, filled with pointers, every pointing stride elements forward.

As says the man page http://www.bitmover.com/lmbench/lat_mem_rd.8.html

The benchmark runs as two nested loops. The outer loop is the stride size. The inner loop is the array size. For each array size, the benchmark creates a ring of pointers that point forward one stride. Traversing the array is done by

 p = (char **)*p;

in a for loop (the over head of the for loop is not significant; the loop is an unrolled loop 1000 loads long). The loop stops after doing a million loads. The size of the array varies from 512 bytes to (typically) eight megabytes. For the small sizes, the cache will have an effect, and the loads will be much faster. This becomes much more apparent when the data is plotted.

More detailed description with examples on POWERs is available from IBM's wiki: Untangling memory access measurements - lat_mem_rd - by Jenifer Hopper 2013

The lat_mem_rd test (http://www.bitmover.com/lmbench/lat_mem_rd.8.html) takes two arguments, an array size in MB and a stride size. The benchmark uses two loops to traverse through the array, using the stride as the increment by creating a ring of pointers that point forward one stride. The test measures memory read latency in nanoseconds for the range of memory sizes. The output consists of two columns: the first is the array size in MB (the floating point value) and the second is the load latency over all the points of the array. When the results are graphed, you can clearly see the relative latencies of the entire memory hierarchy, including the faster latency of each cache level, and the main memory latency.

PS: There is paper from Intel (thanks to Eldar Abusalimov) with examples of running lat_mem_rd: ftp://download.intel.com/design/intarch/PAPERS/321074.pdf - sorry right url is http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-cache-latency-bandwidth-paper.pdf "Measuring Cache and Memory Latency and CPU to Memory Bandwidth - For use with Intel Architecture" by Joshua Ruggiero from December 2008:


Ok, several issues with your code:

  1. As you mentioned, your measurement are taking a long time. In fact, they're very likely to take way longer than the single access itself, so they're not measuring anything useful. To mitigate that, access multiple elements, and amortize (divide the overall time by the number of accesses. Note that to measure latency, you want these accesses to be serialized, otherwise they can be performed in parallel and you'll only measure the throughput of unrelated accesses. To achieve that you could just add a false dependency between accesses.

    For e.g., initialize the array to zeros, and do:

    clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
    for (int i = 0; i < NUM_ACCESSES; ++i) {
        int tmp = arrayAccess[index];                             //Access Value from Main Memory
        index = (index + i + tmp) & 1023;   
    }
    clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
    

    .. and of course remember to divide the time by NUM_ACCESSES.
    Now, i've made the index intentionally complicated so that you avoid a fixed stride which might trigger a prefetcher (a bit of an overkill, you're not likely to notice an impact, but for the sake of demonstration...). You could probably settle for a simple index += 32, which would give you strides of 128k (two cache lines), and avoid the "benefit" of most simple adjacent line/ simple stream prefetchers. I've also replaced the % 1000 with & 1023 since & is faster, but it needs to be power of 2 to work the same way - so just increase ACCESS_SIZE to 1024 and it should work.

  2. Invalidating the L1 by loading something else is good, but the sizes look funny. You didn't specify your system but 256000 seems pretty big for an L1. An L2 is usually 256k on many common modern x86 CPUs for e.g. Also note that 256k is not 256000, but rather 256*1024=262144. Same goes for the second size: 1M is not 1024000, it's 1024*1024=1048576. Assuming that's indeed your L2 size (more likely an L3, but probably too small for that).

  3. Your invalidating arrays are of type int, so each element is longer than a single byte (most likely 4 byte, depending on system). You're actually invalidating L1_CACHE_SIZE*sizeof(int) worth of bytes (and the same goes for the L2 invalidation loop)

Update:

  1. memset receives the size in bytes, your sizes are divided by sizeof(int)

  2. Your invalidation reads are never used, and may be optimized out. Try to accumulate the reads in some value and print it in the end, to avoid this possibility.

  3. The memset at the beginning is accessing the data as well, therefor your first loop is accessing data from the L3 (since the other 2 memsets were still effective in evicting it from L1+L2, although only partially due to the size error.

  4. The strides may be too small so you get two access to the same cacheline (L1 hit). Make sure they're spread enough by adding 32 elements (x4 bytes) - that's 2 cacheline, so you also won't get any adjacent cacheline prefetch benefits.

  5. Since NUM_ACCESSES is larger than ACCESS_SIZE, you're essentially repeating the same elements and would probably get L1 hits for them (so the avg time shifts in favor of L1 access latency). Instead try using the L1 size so you access the entire L1 (except for the skips) exactly once. For e.g. like this -

    index = 0;
    while (index < L1_CACHE_SIZE) {
        int tmp = arrayAccess[index];               //Access Value from L2
        index = (index + tmp + ((index & 4) ? 28 : 36));   // on average this should give 32 element skips, with changing strides
        count++;                                           //divide overall time by this 
    }
    

don't forget to increase arrayAccess to L1 size.

Now, with the changes above (more or less), I get something like this:

L1 Cache Access 7.812500
L2 Cache Acces 15.625000
L3 Cache Access 23.437500

Which still seems a bit long, but possibly because it includes an additional dependency on arithmetic operations


Not really an answer but read anyway some thing has already been mentioned in other answers and comments here

well just the other day I answer this question:

  • Cache size estimation on your system?

it is about measurement of L1/L2/.../L?/MEMORY transfer rates take a look at it for better start point of your problem

[Notes]

  1. I strongly recommend to use RDTSC instruction for time measurement

    especially for L1 as anything else is too slow. Do not forget to set process affinity to single CPU because all cores have their own counter and their count differs a lot even on the same input Clock !!!

    Adjust the CPU clock to Maximum for variable clock computers and do not forget to account for RDTSC overflow if you use just 32bit part (modern CPU overflow 32bit counter in a second). For time computation use CPU clock (measure it or use registry value)

    t0 <- RDTSC
    Sleep(250);
    t1 <- RDTSC
    CPU f=(t1-t0)<<2 [Hz]
    
  2. set process affinity to single CPU

    all CPU cores have usually their own L1,L2 caches so on multi-task OS you can measure confusing things if you do not do this

  3. do graphical output (diagram)

    then you will see what actually happens in that link above I posted quite a few plots

  4. use highest process priority available by OS