Impacts of CPU cache on speed

Question

I just wrote a program to test the impact of CPU cache on speed performance.

void* foo(void* ptr)
{
    int* r = (int*) ptr;

    for (int i = (r - s); i < N; i += NUM_THREADS)
        *r += num[i];        
    return NULL;
}

void* bar(void* ptr)
{
    int* r = (int*) ptr;
    int idx = r - s;
    int block = N/NUM_THREADS;
    int start = idx * block, end = start + block;

    for (int i = start; i < end; ++i)
        *r += num[i];        
    return NULL;
}

Basically, foo() did an interlace scanning, on the other hand, bar() scan the array block-by-block.

Test result indicates that bar() is much faster:

gcc ping-pong.c -std=gnu99 -lpthread -O2  ; ./a.out
1.077037s
0.395525s

So how to interpretate this result?

The full source code is at: https://gist.github.com/4617935

Update: all if-statements removed

Zifei Tong · Accepted Answer

It turns out to be not mysterious at all.

I tried valgrind to profile cache misses, here is the result:

$ valgrind --tool=cachegrind --cachegrind-out-file=profile ./a.out
....
$ cg_annotate profile --auto=yes --show=D1mr --context=1
....
-- line 63 ----------------------------------------
    .  void* foo(void* ptr)
     0  {
     0      int* r = (int*) ptr;
     .  
     0      for (int i = (r - s); i < N; i += NUM_THREADS)
16,388          *r += num[i];
     0      return NULL;
     0  }
     .  
-- line 71 ----------------------------------------

-- line 72 ----------------------------------------
     .  void* bar(void* ptr)
     0  {
     0      int* r = (int*) ptr;
     0      int idx = r - s;
     0      int block = N/NUM_THREADS;
     0      int start = idx * block, end = start + block;
     .  
     0      for (int i = start; i < end; ++i)
 4,098          *r += num[i];
     0      return NULL;
     0  }

As you see, 4 times more L1 cache read misses of foo() than bar, and 4 is just the NUM_THREADS.

So as answered by @Mysticial

Sequential memory access is almost always going to out-perform non-sequential access.

Since more not-sequential memory access means more cache misses.

Leeor · Answer

The reason why sequential accesses are much faster is not due to cache structure, but rather due to HW prefetching (which is related, but not the same). There are several "streaming" prefetchers that can recognize a stream or a stride based access pattern, and run ahead prefetching your data for you.

Some examples (Intel CPUs, but similar principals are commonly used on other CPUs as well) : http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers

The valgrind profiling suggested here demonstrates it, I would suggest looking also at the L2/L3 - on large data sets the useful prefetches are more likely to reside there (rule of thumb - the farther away from the core you are, the more time and storage you have available for aggressive prefetching).

Impacts of CPU cache on speed

Tags:

cpu-architecture

c

optimization

caching

pthreads

Zifei Tong

2 Answers

Zifei Tong

Leeor

Recent Activity

Donate For Us

Impacts of CPU cache on speed

Tags:

cpu-architecture

c

optimization

caching

pthreads

Zifei Tong

2 Answers

Zifei Tong

Leeor

Related questions

Recent Activity

Donate For Us