Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Impacts of CPU cache on speed

I just wrote a program to test the impact of CPU cache on speed performance.

void* foo(void* ptr)
{
    int* r = (int*) ptr;

    for (int i = (r - s); i < N; i += NUM_THREADS)
        *r += num[i];        
    return NULL;
}

void* bar(void* ptr)
{
    int* r = (int*) ptr;
    int idx = r - s;
    int block = N/NUM_THREADS;
    int start = idx * block, end = start + block;

    for (int i = start; i < end; ++i)
        *r += num[i];        
    return NULL;
}

Basically, foo() did an interlace scanning, on the other hand, bar() scan the array block-by-block.

Test result indicates that bar() is much faster:

gcc ping-pong.c -std=gnu99 -lpthread -O2  ; ./a.out
1.077037s
0.395525s

So how to interpretate this result?

The full source code is at: https://gist.github.com/4617935

Update: all if-statements removed

like image 656
Zifei Tong Avatar asked Jan 24 '13 04:01

Zifei Tong


2 Answers

It turns out to be not mysterious at all.

I tried valgrind to profile cache misses, here is the result:

$ valgrind --tool=cachegrind --cachegrind-out-file=profile ./a.out
....
$ cg_annotate profile --auto=yes --show=D1mr --context=1
....
-- line 63 ----------------------------------------
    .  void* foo(void* ptr)
     0  {
     0      int* r = (int*) ptr;
     .  
     0      for (int i = (r - s); i < N; i += NUM_THREADS)
16,388          *r += num[i];
     0      return NULL;
     0  }
     .  
-- line 71 ----------------------------------------

-- line 72 ----------------------------------------
     .  void* bar(void* ptr)
     0  {
     0      int* r = (int*) ptr;
     0      int idx = r - s;
     0      int block = N/NUM_THREADS;
     0      int start = idx * block, end = start + block;
     .  
     0      for (int i = start; i < end; ++i)
 4,098          *r += num[i];
     0      return NULL;
     0  }

As you see, 4 times more L1 cache read misses of foo() than bar, and 4 is just the NUM_THREADS.

So as answered by @Mysticial

Sequential memory access is almost always going to out-perform non-sequential access.

Since more not-sequential memory access means more cache misses.

like image 78
Zifei Tong Avatar answered Oct 03 '22 21:10

Zifei Tong


The reason why sequential accesses are much faster is not due to cache structure, but rather due to HW prefetching (which is related, but not the same). There are several "streaming" prefetchers that can recognize a stream or a stride based access pattern, and run ahead prefetching your data for you.

Some examples (Intel CPUs, but similar principals are commonly used on other CPUs as well) : http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers

The valgrind profiling suggested here demonstrates it, I would suggest looking also at the L2/L3 - on large data sets the useful prefetches are more likely to reside there (rule of thumb - the farther away from the core you are, the more time and storage you have available for aggressive prefetching).

like image 43
Leeor Avatar answered Oct 03 '22 20:10

Leeor