Consider the following functions which store values row-wise-ly and column-wise-ly.
#include <Rcpp.h>
using namespace Rcpp;
const int m = 10000;
const int n = 3;
// [[Rcpp::export]]
SEXP rowWise() {
SEXP A = Rf_allocMatrix(INTSXP, m, n);
int* p = INTEGER(A);
int i, j;
for (i = 0; i < m; i++){
for(j = 0; j < n; j++) {
p[m * j + i] = j;
}
}
return A;
}
// [[Rcpp::export]]
SEXP columnWise() {
SEXP A = Rf_allocMatrix(INTSXP, n, m);
int* p = INTEGER(A);
int i, j;
for(j = 0; j < m; j++) {
for (i = 0; i < n; i++){
p[n * j + i] = i;
}
}
return A;
}
/*** R
library(microbenchmark)
gc()
microbenchmark(
rowWise(),
columnWise(),
times = 1000
)
*/
The above code yields
Unit: microseconds
expr min lq mean median uq max neval
rowWise() 12.524 18.631 64.24991 20.4540 24.8385 10894.353 1000
columnWise() 11.803 19.434 40.08047 20.9005 24.1585 8590.663 1000
Assigning values row-wise-ly is faster (if not slower) than assigning them column-wise-ly, which is counter-intuitive to what I believe.
However, it does depend magically on the value of m
and n
. So I guess my question is: why columnWise
is not much faster than rowWise
?
The dimension (shape) of the matrix has an impact.
When we do a row-wise scan of a 10000 x 3
integer matrix A
, we can still effectively do caching. For simplicity of illustration, I assume that each column of A
are aligned to a cache line.
--------------------------------------
A[1, 1] A[1, 2] A[1, 3] M M M
A[2, 1] A[2, 2] A[2, 3] H H H
. . . . . .
. . . . . .
A[16,1] A[16,2] A[16,3] H H H
--------------------------------------
A[17,1] A[17,2] A[17,3] M M M
A[18,1] A[18,2] A[18,3] H H H
. . . . . .
. . . . . .
A[32,1] A[32,2] A[32,3] H H H
--------------------------------------
A[33,1] A[33,2] A[33,3] M M M
A[34,1] A[34,2] A[34,3] H H H
. . . . . .
. . . . . .
A 64-bit cache line can hold 16 integers. When we access A[1, 1]
, a full cache line is filled, that is, A[1, 1]
to A[16, 1]
are all loaded into cache. When we scan a row A[1, 1], A[1, 2], A[1, 3]
, a 16 x 3
matrix is loaded into cache and it is much smaller than cache capacity (32 KB). While we have a cache miss (M) for each element in the 1st row, when we start to scan the 2nd row, we have a cache hit (H) for every element. So we have a periodic pattern as such:
[3 Misses] -> [45 Hits] -> [3 Misses] -> [45 Hits] -> ...
That is, we have on average a cache miss ratio of 3 / 48 = 1 / 16 = 6.25%
. In fact, this equals to the cache miss ratio if we scan A
column-wise, where we have the following periodic pattern:
[1 Miss] -> [15 Hits] -> [1 Miss] -> [15 Hits] -> ...
Try a 5000 x 5000
matrix. In that case, after reading the first row, 16 x 5000
elements are fetched into cache but that is much larger than cache capacity so cache eviction has happened to kick out the A[1, 1]
to A[16, 1]
(most cache applies "least recently unused" cache line replacement policy). When we come back to scan the 2nd row, we have to fetch A[2, 1]
from RAM again. So a row-wise scan gives a cache miss ratio of 100%
. In contrasts, a column-wise scan only has a cache miss ratio of 1 / 16 = 6.25%
. In this example, we will observe that column-wise scan is much faster.
In summary, with a 10000 x 3
matrix, we have the same cache performance whether we scan it by row or column. I don't see that rowWise
is faster than columnWise
from the median time reported by microbenchmark
. Their execution time may not be exactly equal, but the difference is too minor to cause our concern.
For a
5000 x 5000
matrix,rowWise
is much slower thancolumnWise
.
Thanks for verification.
The "golden rule" that we should ensure sequential memory access in the innermost loop is a general guideline for efficiency. But don't understand it in the narrow sense.
In fact, if you treat the three columns of A
as three vectors x
, y
, z
, and consider the element-wise addition (i.e., the row-wise sum of A
): z[i] = x[i] + y[i]
, are we not having a sequential access for all three vectors? Doesn't this fall into the "golden rule"? Scanning a 10000 x 3
matrix by row is no difference from alternately reading three vectors sequentially. And this is very efficient.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With