Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What memory access patterns are most efficient for outer-product-type double loops?

What access patterns are most efficient for writing cache-efficient outer-product type code that maximally exploits data data locality?

Consider a block of code for processing all pairs of elements of two arrays such as:

for (int i = 0; i < N; i++)
    for (int j = 0; j < M; j++)
        out[i*M + j] = X[i] binary-op Y[j];

This is a standard vector-vector outer product when binary-op is scalar multiplication and X and Y are 1d, but this same pattern is also matrix multiplication when X and Y are matrices and binary-op is a dot product between the ith row and j-th column of two matrices.

For matrix multiplication, I know optimized BLASs like OpenBLAS and MKL can get much higher performance than you get from the double loop style code above, because they process the elements in chunks in such a way as to exploit the CPU cache much more. Unfortunately, OpenBLAS kernels are written in assembly so it's pretty difficult to figure out what's going on.

Are there any good "tricks of the trade" for re-organizing these types of double loops to improve cache performance?

Since each element of out is only hit once, we're clearly free to reorder the iterations. The straight linear traversal of out is the easiest to write, but I don't think it's the most efficient pattern to execute, since you don't exploit any locality in X.

I'm especially interested in the setting where M and N are large, and the size of each element (X[i], and Y[j]) is pretty small (like O(1) bytes), so were talking about something analogous to vector-vector outer product or the multiplication of a tall and skinny matrix by a short and fat matrix (e.g. N x D by D x M where D is small).

like image 851
Robert T. McGibbon Avatar asked Jul 05 '14 02:07

Robert T. McGibbon


1 Answers

For large enough M, The Y vector will exceed the L1 cache size.* Thus on every new outer iteration, you'll be reloading Y from main memory (or at least, a slower cache). In other words, you won't be exploiting temporal locality in Y.

You should block up your accesses to Y; something like this:

for (jj = 0; jj < M; jj += CACHE_SIZE) {        // Iterate over blocks
    for (i = 0; i < N; i++) {
        for (j = jj; j < (jj + CACHE_SIZE); j++) {   // Iterate within block
            out[i*M + j] = X[i] * Y[j];
        }
    }
}

The above doesn't do anything smart with accesses to X, but new values are only being accessed 1/CACHE_SIZE as often, so the impact is probably negligible.


* If everything is small enough to already fit in cache, then you can't do better than what you already have (vectorisation opportunities notwithstanding).
like image 129
Oliver Charlesworth Avatar answered Sep 20 '22 14:09

Oliver Charlesworth