Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why sampling matrix row is very slow?

I tried to do some bootstrapping and calculate colMeans, naturally I chose matrix to store data, however, it is very slow in sampling:

m[sample(n,replace=TRUE),]

It turns out data.table is the fastest.

require(microbenchmark)
require(data.table)
n = 2000
nc = 8000
m = matrix(1:(n*nc) ,nrow = n)
DF = as.data.frame(m)
DT = as.data.table(m)

s=sample(n, replace=TRUE)
microbenchmark(m[s,], DF[s,],DT[s,])

# Unit: milliseconds
    # expr      min       lq     mean   median       uq      max neval
  # m[s, ] 371.9271 402.3542 421.7907 420.8446 437.8251 506.1788   100
 # DF[s, ] 182.3189 199.0865 218.0746 213.9451 231.1518 409.8625   100
 # DT[s, ] 129.8225 139.1977 156.9506 150.4321 164.3104 254.2048   100

Why is sampling matrix much slower than the other two?

like image 802
user3226167 Avatar asked Mar 06 '17 05:03

user3226167


1 Answers

Two possibilities spring to mind on first glance, both in R's function MatrixSubset on line 265.

It might be neither of these. Just guessing.

1. It appears to loop in a cache inefficient direction.

for (i = 0; i < nrs; i++) {    // rows
  ...
  for (j = 0; j < ncs; j++) {  // columns
    ...

Your example has a lot of columns (8,000). Each time the inner loop fetches a new column, it needs to fetch the page of RAM holding that value from RAM into cache (most likely L2). The next fetch is a different column and so it's less likely to be able to reuse a page already in L2. A matrix is internally one huge contiguous vector: all of column 1 followed by all of column 2, etc. A page fetch is relatively expensive. Going in the "wrong" direction incurs too many page fetches. More about CPU cache here.

A good compiler should perform Loop interchange automatically such as gcc -floop-interchange which is on by default. More here. This optimization might not be happening in this case due to the complexity of what's inside the for loops; perhaps in this case the switch statements. Or perhaps the version of R you're using on your OS wasn't compiled with a compiler with that option, or wasn't turned on.

2. The switch() is too deep

The switch on type is happening on each and every item in the matrix. Even though a matrix is a single type! So this is wasteful. Even if the switch is being optimized with a jump table that jump table is probably still happening for every item in the matrix ('probably' because the CPU might predict the switch). Since your example matrix is tiny at 61MB, I'm leaning more towards this being the culprit rather than going in the wrong direction.

Proposed fix for both above (untested)

// Check the row numbers once up front rather than 8,000 times.
// This is a contiguous sweep and therefore almost instant
// Declare variables i and ii locally for safety and maximum compiler optimizations
for (int i = 0; i < nrs; i++) {
  int ii = INTEGER(sr)[i];
  if (ii != NA_INTEGER && (ii < 1 || ii > nr))
    errorcall(call, R_MSG_subs_o_b);
}

// Check the column numbers up front once rather than 2,000 times
for (int j = 0; j < ncs; j++) {
  int jj = INTEGER(sc)[j];
  if (jj != NA_INTEGER && (jj < 1 || jj > nc))
    errorcall(call, R_MSG_subs_o_b);
}

// Now switch once on type rather than 8,000 * 2,000 times
// Loop column-by-column not row-by-row

int resi=0;  // contiguous write to result (for page efficiency)
int ii, jj;  // the current row and column, bounds checked above
switch (TYPEOF(x)) {
  case LGLSXP:  // the INTSXP will work for LGLSXP too, currently
  case INTSXP:
    for (int j=0; j<ncs; j++) {  // column-by-column
      jj = INTEGER(sc)[j];
      for (int i=0; i<nrs; i++) {  // within-this-column
        ii = INTEGER(sr)[i];
        INTEGER(result)[resi++] = (ii == NA_INTEGER || jj == NA_INTEGER) ? NA_INTEGER : INTEGER(x)[ii + jj * nr];
      }
    }
    break;
  case REALSXP:
    for (int j=0; j<ncs; j++) {
      jj = INTEGER(sc)[j];
      for (int i=0; i<nrs; i++) {
        ii = INTEGER(sr)[i];
        REAL(result)[resi++] = (ii == NA_INTEGER || jj == NA_INTEGER) ? NA_REAL : REAL(x)[ii + jj * nr];
      }
    }
    break;
  case ...

As you can see, there is more code this way because the same for loops have to be repeated over and over within the switch() cases. Code readability and robustness reasons may be why the original code is the way it is: there is less chance of a typo in R's implementation. That's already demonstrated because I was lazy in not implementing the LGLSXP case specially for LOGICAL. I know LOGICAL is exactly the same as INTEGER currently in base R. But that might change in future so my laziness (due to code bloat) could well cause a bug in R in future if LOGICAL does change (to say char rather than int for RAM efficiency).

One possible option to solve the code bloat problem, notice that all that's really happening is moving memory around. So all the types (other than STRSXP, VECSXP and EXPRSXP) can be done with a single double-for-loop using memcpy with the type's size. SET_STRING_ELT and SET_VECTOR_ELT still must be used to maintain reference counts on those objects. So that should be just 3 repetitions of the double for loops to maintain. Alternatively, that idiom can be wrapped into a #define which is done in other parts of R.

Finally, whether there are any NAs in the row or columns passed in (a very common case to not request the NA'th row or NA'th column!) can be detected in the first bounds checking loop. If there are no NAs then the deepest ternary ((ii == NA_INTEGER || jj == NA_INTEGER) ? :) (2000 * 8000 calls to that branch) can be saved by raising that branch outside. But with the cost of more complex repeated code. However, perhaps branch prediction would kick in reliably on all architectures and we shouldn't worry about that.

data.table does both the memcpy trick and deep branch saving in some but not all places. It has also started to subset in parallel, column by column. But not in this case yet just because it's new and still being rolled out (setkey is very similar and is already parallel). The master thread handles the character and list columns one by one though (not in parallel) since SET_STRING_ELT and SET_VECTOR_ELT are not thread-safe in R. The other threads handle all the integer, real, complex and raw columns in parallel. It then goes as fast as memory io can go.

I don't really see the difference that you see on 61MB but scaling up to (still small) 610MB by increasing number of columns 10x to 80,000 I do see a difference.

n = 2000
nc = 8000    # same size as your example (61MB), on my laptop
microbenchmark(m[s,], DF[s,],DT[s,])
Unit: milliseconds
    expr       min        lq      mean    median        uq      max neval
  m[s, ] 108.75182 112.11678 118.60111 114.58090 120.07952 168.6079   100
 DF[s, ] 100.95019 105.88253 116.04507 110.84693 118.08092 163.9666   100
 DT[s, ]  63.78959  69.07341  80.72039  72.69873  96.51802 136.2016   100

n = 2000
nc = 80000     # 10x bigger (610MB)
microbenchmark(m[s,], DF[s,],DT[s,])
Unit: milliseconds
    expr       min        lq      mean    median        uq      max neval
  m[s, ] 1990.3343 2010.1759 2055.9847 2032.9506 2057.2498 2733.278   100
 DF[s, ] 1083.0373 1212.6633 1265.5346 1234.1558 1300.7502 2105.177   100
 DT[s, ]  698.1295  830.3428  865.5918  862.5773  907.7225 1053.393   100

I have 128MB of L4 cache, though. I guess you have less cache. The entire 61MB fits in my L4 cache so I don't really notice the cache inefficiency at that size.

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 70
Model name:            Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
Stepping:              1
CPU MHz:               3345.343
CPU max MHz:           4000.0000
CPU min MHz:           800.0000
BogoMIPS:              5587.63
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K
L4 cache:              131072K
NUMA node0 CPU(s):     0-7
like image 104
Matt Dowle Avatar answered Nov 02 '22 14:11

Matt Dowle