Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Perf and Papi give different values for L3 cache references and misses?

Tags:

I am working on a project where we have to implement an algorithm that is proven in theory to be cache friendly. In simple terms, if N is the input and B is the number of elements that get transferred between the cache and the RAM every time we have a cache miss, the algorithm will require O(N/B) accesses to the RAM.

I would like to show that this is indeed the behavior in practice. To better understand how one can measure various cache related hardware counters, I decided to use different tools. One is Perf and the other is the PAPI library. Unfortunately, the more I work with these tools, the less I understand what they do exactly.

I am using an Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz with 8 GB of RAM, L1 cache 256 KB, L2 cache 1 MB, L3 cache 6 MB. The cache line size is 64 bytes. I guess that must be the size of the block B.

Let's look at the following example:

#include <iostream>

using namespace std;

struct node{
    int l, r;
};

int main(int argc, char* argv[]){

    int n = 1000000;

    node* A = new node[n];

    int i;
    for(i=0;i<n;i++){
        A[i].l = 1;
        A[i].r = 4;
    }

    return 0;
}

Each node requires 8 bytes, which means that a cache line can fit 8 nodes, so I should be expecting approximately 1000000/8 = 125000 L3 cache misses.

Without optimization (no -O3), this is the output from perf:

 perf stat -B -e cache-references,cache-misses ./cachetests 

 Performance counter stats for './cachetests':

       162,813      cache-references                                            
       142,247      cache-misses              #   87.368 % of all cache refs    

   0.007163021 seconds time elapsed

It is pretty close to what we are expecting. Now suppose that we use the PAPI library.

#include <iostream>
#include <papi.h>

using namespace std;

struct node{
    int l, r;
};

void handle_error(int err){
    std::cerr << "PAPI error: " << err << std::endl;
}

int main(int argc, char* argv[]){

    int numEvents = 2;
    long long values[2];
    int events[2] = {PAPI_L3_TCA,PAPI_L3_TCM};

    if (PAPI_start_counters(events, numEvents) != PAPI_OK)
        handle_error(1);

    int n = 1000000;
    node* A = new node[n];
    int i;
    for(i=0;i<n;i++){
        A[i].l = 1;
        A[i].r = 4;
    }

    if ( PAPI_stop_counters(values, numEvents) != PAPI_OK)
        handle_error(1);

    cout<<"L3 accesses: "<<values[0]<<endl;
    cout<<"L3 misses: "<<values[1]<<endl;
    cout<<"L3 miss/access ratio: "<<(double)values[1]/values[0]<<endl;

    return 0;
}

This is the output that I get:

L3 accesses: 3335
L3 misses: 848
L3 miss/access ratio: 0.254273

Why such a big difference between the two tools?