Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++: Improving cache performance in a 3d array

Tags:

c++

caching

I don't know how to optimize cache performance at a really low level, thinking about cache-line size or associativity. That's not something you can learn overnight. Considering my program will run on many different systems and architectures, I don't think it would be worth it anyway. But still, there are probably some steps I can take to reduce cache misses in general.

Here is a description of my problem:

I have a 3d array of integers, representing values at points in space, like [x][y][z]. Each dimension is the same size, so it's like a cube. From that I need to make another 3d array, where each value in this new array is a function of 7 parameters: the corresponding value in the original 3d array, plus the 6 indices that "touch" it in space. I'm not worried about the edges and corners of the cube for now.

Here is what I mean in C++ code:

void process3DArray (int input[LENGTH][LENGTH][LENGTH], 
                     int output[LENGTH][LENGTH][LENGTH])
{
    for(int i = 1; i < LENGTH-1; i++)
        for (int j = 1; j < LENGTH-1; j++)
            for (int k = 1; k < LENGTH-1; k++)
            //The for loops start at 1 and stop before LENGTH-1
            //or other-wise I'll get out-of-bounds errors
            //I'm not concerned with the edges and corners of the 
            //3d array "cube" at the moment.
            {
                int value = input[i][j][k];

                //I am expecting crazy cache misses here:
                int posX = input[i+1] [j]   [k];
                int negX = input[i-1] [j]   [k];
                int posY = input[i]   [j+1] [k];
                int negY = input[i]   [j-1] [k];
                int posZ = input[i]   [j]   [k+1];
                int negZ = input[i]   [j]   [k-1];

                output [i][j][k] = 
                    process(value, posX, negX, posY, negY, posZ, negZ);
            }
}

However, it seems like if LENGTH is large enough, I'll get tons of cache misses when I'm fetching the parameters for process. Is there a cache-friendlier way to do this, or a better way to represent my data other than a 3d array?

And if you have the time to answer these extra questions, do I have to consider the value of LENGTH? Like it's different whether LENGTH is 20 vs 100 vs 10000. Also, would I have to do something else if I used something other than integers, like maybe a 64-byte struct?

@ ildjarn:

Sorry, I did not think that the code that generates the arrays I am passing into process3DArray mattered. But if it does, I would like to know why.

int main() {
    int data[LENGTH][LENGTH][LENGTH];
    for(int i = 0; i < LENGTH; i++)
        for (int j = 0; j < LENGTH; j++)
            for (int k = 0; k < LENGTH; k++)
                data[i][j][k] = rand() * (i + j + k);

    int result[LENGTH][LENGTH][LENGTH];
    process3DArray(data, result);
}
like image 364
newprogrammer Avatar asked Mar 30 '12 23:03

newprogrammer


2 Answers

There's an answer to a similar question here: https://stackoverflow.com/a/7735362/6210 (by me!)

The main goal of optimizing a multi-dimensional array traversal is to make sure you visit the array such that you tend to reuse the cache lines accessed from the previous iteration step. For visiting each element of an array once and only once, you can do this just by visiting in memory order (as you are doing in your loop).

Since you are doing something more complicated than a simple element traversal (visiting an element plus 6 neighbors), you need to break up your traversal such that you don't access too many cache lines at once. Since the cache thrashing is dominated by traversing along j and k, you just need to modify the traversal such that you visit blocks at a time rather than rows at a time.

E.g.:

const int CACHE_LINE_STEP= 8;

void process3DArray (int input[LENGTH][LENGTH][LENGTH], 
                     int output[LENGTH][LENGTH][LENGTH])
{
    for(int i = 1; i < LENGTH-1; i++)
        for (int k_start = 1, k_next= CACHE_LINE_STEP; k_start < LENGTH-1; k_start= k_next; k_next+= CACHE_LINE_STEP)
        {
            int k_end= min(k_next, LENGTH - 1);

            for (int j = 1; j < LENGTH-1; j++)
                //The for loops start at 1 and stop before LENGTH-1
                //or other-wise I'll get out-of-bounds errors
                //I'm not concerned with the edges and corners of the 
                //3d array "cube" at the moment.
            {
                for (int k= k_start; k<k_end; ++k)
                {
                    int value = input[i][j][k];

                    //I am expecting crazy cache misses here:
                    int posX = input[i+1] [j]   [k];
                    int negX = input[i-1] [j]   [k];
                    int posY = input[i]   [j+1] [k];
                    int negY = input[i]   [j-1] [k];
                    int posZ = input[i]   [j]   [k+1];
                    int negZ = input[i]   [j]   [k-1];

                    output [i][j][k] = 
                        process(value, posX, negX, posY, negY, posZ, negZ);
                }
            }
        }
}

What this does in ensure that you don't thrash the cache by visiting the grid in a block oriented fashion (actually, more like a fat column oriented fashion bounded by the cache line size). It's not perfect as there are overlaps that cross cache lines between columns, but you can tweak it to make it better.

like image 100
MSN Avatar answered Nov 03 '22 12:11

MSN


The most important thing you already have right. If you were using Fortran, you'd be doing it exactly wrong, but that's another story. What you have right is that you are processing in the inner loop along the direction where memory addresses are closest together. A single memory fetch (beyond the cache) will pull in multiple values, corresponding to a series of adjacent values of k. Inside your loop the cache will contain some number of values from i,j; a similar number from i+/-1, j and from i,j+/-1. So you basically have five disjoint sections of memory active. For small values of LENGTH these will only be 1 or three sections of memory. It is in the nature of how caches are built that you can have more than this many disjoint sections of memory in your active set.

I hope process() is small, and inline. Otherwise this may well be insignificant. Also, it will affect whether your code fits in the instruction cache.

Since you're interested in performance, it is almost always better to initialize five pointers (you only need one for value, posZ and negZ), and then take *(p++) inside the loop.

input[i+1] [j]   [k];

is asking the compiler to generate 3 adds and two multiplies, unless you have a very good optimizer. If your compiler is particularly lazy about register allocation, you also get four memory accesses; otherwise one.

*inputIplusOneJK++ 

is asking for one add and a memory reference.

like image 22
DRVic Avatar answered Nov 03 '22 13:11

DRVic