I'm trying to come up with an example program which would have a high cache-miss rate. I thought I could try accessing a matrix column by column like so: <pre class="prettyprint"><code>#include <stdlib.h> int main(void) { int i, j, k; int w = 1000; int h = 1000; int **block = malloc(w * sizeof(int*)); for (i = 0; i < w; i++) { block[i] = malloc(h * sizeof(int)); } for (k = 0; k < 10; k++) { for (i = 0; i < w; i++) { for (j = 0; j < h; j++) { block[j][i] = 0; } } } return 0; } </code></pre> when I compile this with <code>-O0</code> flag and run using <code>perf stat -r 5 -B -e cache-references,cache-misses ./a.out</code> it gives me: <pre class="prettyprint"><code> Performance counter stats for './a.out' (5 runs): 715,463 cache-references ( +- 0.42% ) 527,634 cache-misses # 73.747 % of all cache refs ( +- 2.53% ) 0.112001160 seconds time elapsed ( +- 1.58% ) </code></pre> which is good enough for my purposes. However if I go ahead and change the matrix size to <code>2000x2000</code> it gives: <pre class="prettyprint"><code> Performance counter stats for './a.out' (5 runs): 6,364,995 cache-references ( +- 2.32% ) 2,534,989 cache-misses # 39.827 % of all cache refs ( +- 0.02% ) 0.461104903 seconds time elapsed ( +- 0.92% ) </code></pre> and if I increase it even further to <code>3000x3000</code> I get: <pre class="prettyprint"><code> Performance counter stats for './a.out' (5 runs): 59,204,028 cache-references ( +- 1.36% ) 5,662,629 cache-misses # 9.565 % of all cache refs ( +- 0.11% ) 1.116573625 seconds time elapsed ( +- 0.32% ) </code></pre> which is strange because I would expect to get more cache miss rate as the size increases. I need something that will be as platform independent as possible. computer architecture class was long ago so any insight would be welcomed.. Notes I said I need something relatively platform independent but still these are my specs: <ul> <li>Intel® Core™ i5-2467M</li> <li>4 GiB RAM</li> <li>64 bit ubuntu 12.04</li> </ul>

Beware of automatic prefetch in modern CPUs - it can often detect strided accesses. Perhaps try a random access pattern, e.g.: <pre class="prettyprint"><code>int main(void) { int i; int n = 1000 * 1000; int *block = malloc(n * sizeof(int)); for (i = 0; i < n / 10; i++) { int ri = rand() % n; block[ri] = 0; } return 0; } </code></pre>

How to come up with a high cache miss rate example?

Tags:

c++

c

caching

perf

I'm trying to come up with an example program which would have a high cache-miss rate. I thought I could try accessing a matrix column by column like so:

#include <stdlib.h>

int main(void)
{
    int i, j, k;

    int w = 1000;
    int h = 1000;

    int **block = malloc(w * sizeof(int*));
    for (i = 0; i < w; i++) {
        block[i] = malloc(h * sizeof(int));
    }

    for (k = 0; k < 10; k++) {
        for (i = 0; i < w; i++) {
            for (j = 0; j < h; j++) {
                block[j][i] = 0;
            }
        }
    }

    return 0;
}

when I compile this with -O0 flag and run using perf stat -r 5 -B -e cache-references,cache-misses ./a.out it gives me:

 Performance counter stats for './a.out' (5 runs):

    715,463 cache-references                                      ( +-  0.42% )
    527,634 cache-misses          #   73.747 % of all cache refs  ( +-  2.53% )

0.112001160 seconds time elapsed                                  ( +-  1.58% )

which is good enough for my purposes. However if I go ahead and change the matrix size to 2000x2000 it gives:

 Performance counter stats for './a.out' (5 runs):

  6,364,995 cache-references                                      ( +-  2.32% )
  2,534,989 cache-misses          #   39.827 % of all cache refs  ( +-  0.02% )

0.461104903 seconds time elapsed                                  ( +-  0.92% )

and if I increase it even further to 3000x3000 I get:

 Performance counter stats for './a.out' (5 runs):

 59,204,028 cache-references                                      ( +-  1.36% )
  5,662,629 cache-misses          #    9.565 % of all cache refs  ( +-  0.11% )

1.116573625 seconds time elapsed                                  ( +-  0.32% )

which is strange because I would expect to get more cache miss rate as the size increases. I need something that will be as platform independent as possible. computer architecture class was long ago so any insight would be welcomed..

Notes

I said I need something relatively platform independent but still these are my specs:

Intel® Core™ i5-2467M
4 GiB RAM
64 bit ubuntu 12.04

651

asked Jan 31 '13 21:01

none

1 Answers

Beware of automatic prefetch in modern CPUs - it can often detect strided accesses. Perhaps try a random access pattern, e.g.:

int main(void)
{
    int i;

    int n = 1000 * 1000;

    int *block = malloc(n * sizeof(int));

    for (i = 0; i < n / 10; i++) {
         int ri = rand() % n;
         block[ri] = 0;
    }

    return 0;
}

154

answered Oct 23 '22 19:10

Paul R

Related questions
                            
                                Is this correct way to combine std::generate_n and std::back_inserter?
                            
                                TinyXML looping over elements
                            
                                Linker error when using unique_ptr in C++/CLI
                            
                                Is the use of constants in C++ discouraged?
                            
                                Boost test crashes on exit with Clang 4.1 (LLVM 3.1svn)
                            
                                Getting the address of template class object leads to full instantiation of template parameters
                            
                                Check for empty intersection in STL
                            
                                What does random_ints(a,N) do and how do I use it in my code? [closed]
                            
                                Fastest way to find up to 6 consecutive 0 bits in a char array
                            
                                C++ stl unordered_map, thread safety where each thread accesses only it's own assigned key and may edit that value
                            
                                C++ vector array operator high computational cost?
                            
                                Pure functions in C++11
                            
                                Better way to open HDF5 files in C++
                            
                                How to take kinect video image and depth image with openCV c++?
                            
                                empty brace-init-lists emit warnings about missing field initializers
                            
                                Double Buffering? Win32 c++
                            
                                How do I get the new C++ threading support on Mac OS X with clang?
                            
                                how many bytes actually written by ostream::write?
                            
                                Reverse PInvoke from native C++
                            
                                Is it possible to erase elements of a std::list in a c++11 for each loop

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With