What access patterns are most efficient for writing cache-efficient outer-product type code that maximally exploits data data locality?
Consider a block of code for processing all pairs of elements of two arrays such as:
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++)
out[i*M + j] = X[i] binary-op Y[j];
This is a standard vector-vector outer product when binary-op
is scalar multiplication and X
and Y
are 1d, but this same pattern is also matrix multiplication when X
and Y
are matrices and binary-op
is a dot product between the i
th row and j
-th column of two matrices.
For matrix multiplication, I know optimized BLASs like OpenBLAS and MKL can get much higher performance than you get from the double loop style code above, because they process the elements in chunks in such a way as to exploit the CPU cache much more. Unfortunately, OpenBLAS kernels are written in assembly so it's pretty difficult to figure out what's going on.
Are there any good "tricks of the trade" for re-organizing these types of double loops to improve cache performance?
Since each element of out
is only hit once, we're clearly free to reorder the iterations. The straight linear traversal of out
is the easiest to write, but I don't think it's the most efficient pattern to execute, since you don't exploit any locality in X
.
I'm especially interested in the setting where M
and N
are large, and the size of each element (X[i]
, and Y[j]
) is pretty small (like O(1) bytes), so were talking about something analogous to vector-vector outer product or the multiplication of a tall and skinny matrix by a short and fat matrix (e.g. N x D
by D x M
where D
is small).
For large enough M
, The Y
vector will exceed the L1 cache size.* Thus on every new outer iteration, you'll be reloading Y
from main memory (or at least, a slower cache). In other words, you won't be exploiting temporal locality in Y
.
You should block up your accesses to Y
; something like this:
for (jj = 0; jj < M; jj += CACHE_SIZE) { // Iterate over blocks
for (i = 0; i < N; i++) {
for (j = jj; j < (jj + CACHE_SIZE); j++) { // Iterate within block
out[i*M + j] = X[i] * Y[j];
}
}
}
The above doesn't do anything smart with accesses to X
, but new values are only being accessed 1/CACHE_SIZE
as often, so the impact is probably negligible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With