Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimize Read and Write to subsections of a matrix in R (possibly using data.table)

TL;DR

What is the fastest method in R for reading and writing a subset of columns from a very large matrix. I attempt a solution with data.table but need a fast way to extract a sequence of columns?

Short Answer: The expensive part of the operation is assignment. Thus the solution is to stick with a matrix and use Rcpp and C++ to modify the matrix in place. There are two excellent answers below with examples.[for those applying to other problems be sure to read the disclaimers in the solutions!]. Scroll to the bottom of the question for some more lessons learned.


This is my first Stack Overflow question- I greatly appreciate your time in taking a look and I apologize if I've left anything out. I'm working on an R package where I have a performance bottleneck from subsetting and writing to portions of a matrix (NB for statisticians the application is updating sufficient statistics after processing each data point). The individual operations are incredibly fast but the sheer number of them requires it to be as fast as possible. The simplest version of the idea is a matrix of dimension K by V where K is generally between 5 and 1000 and V can be between 1000 and 1,000,000.

set.seed(94253) K <- 100 V <- 100000 mat <-  matrix(runif(K*V),nrow=K,ncol=V) 

we then end up performing a calculation on a subset of the columns and adding this into the full matrix. thus naively it looks like

Vsub <- sample(1:V, 20) toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub)) mat[,Vsub] <- mat[,Vsub] + toinsert library(microbenchmark) microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert) 

because this is done so many times it can be quite slow as a result of R's copy-on-change semantics (but see the lessons learned below, modification can actually happen in place in some cricumstances).

For my problem the object need not be a matrix (and I'm sensitive to the difference as outlined here Assign a matrix to a subset of a data.table). I always want the full column and so the list structure of a data frame is fine. My solution was to use Matthew Dowle's awesome data.table package. The write can be done extraordinarily quickly using set(). Unfortunately getting the value is somewhat more complicated. We have to call the variables setting with=FALSE which dramatically slows things down.

library(data.table) DT <- as.data.table(mat)   set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert)) 

Within the set() function using i=NULL to reference all rows is incredibly fast but (presumably due to the way things are stored under the hood) there is no comparable option for j. @Roland notes in the comments that one option would be to convert to a triple representation (row number, col number, value) and use data.tables binary search to speed retrieval. I tested this manually and while it is quick, it does approximately triple the memory requirements for the matrix. I would like to avoid this if possible.

Following the question here: Time in getting single elemets from data.table and data.frame objects. Hadley Wickham gave an incredibly fast solution for a single index

Vone <- Vsub[1] toinsert.one <- toinsert[,1] set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one)) 

however since the .subset2(DT,i) is just DT[[i]] without the methods dispatch there is no way (to my knowledge) to grab several columns at once although it certainly seems like it should be possible. As in the previous question, it seems like since we can overwrite the values quickly we should be able to read them quickly.

Any suggestions? Also please let me know if there is a better solution than data.table for this problem. I realized its not really the intended use case in many respects but I'm trying to avoid porting the whole series of operations to C.

Here are a sequence of timings of elements discussed- the first two are all columns, the second two are just one column.

microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,               set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert)),               mat[,Vone] <- mat[,Vone] + toinsert.one,               set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one)),               times=1000L)  Unit: microseconds                   expr      min       lq   median       uq       max neval                 Matrix   51.970   53.895   61.754   77.313   135.698  1000             Data.Table 4751.982 4962.426 5087.376 5256.597 23710.826  1000      Matrix Single Col    8.021    9.304   10.427   19.570 55303.659  1000  Data.Table Single Col    6.737    7.700    9.304   11.549    89.824  1000 

Answer and Lessons Learned:

Comments identified the most expensive part of the operation as the assignment process. Both solutions give answers based on C code which modify the matrix in place breaking R convention of not modifying the argument to a function but providing a much faster result.

Hadley Wickham stopped by in the comments to note that the matrix modification is actually done in place as long as the object mat is not referenced elsewhere (see http://adv-r.had.co.nz/memory.html#modification-in-place). This points to an interesting and subtle point. I was performing my evaluations in RStudio. RStudio as Hadley notes in his book creates an additional reference for every object not within a function. Thus while in the performance case of a function the modification would happen in place, at the command line it was producing a copy-on-change effect. Hadley's package pryr has some nice functions for tracking references and addresses of memory.

like image 444
bstewart Avatar asked Nov 21 '13 16:11

bstewart


2 Answers

Fun with Rcpp:

You can use Eigen's Map class to modify an R object in place.

library(RcppEigen) library(inline)  incl <- ' using  Eigen::Map; using  Eigen::MatrixXd; using  Eigen::VectorXi; typedef  Map<MatrixXd>  MapMatd; typedef  Map<VectorXi>  MapVeci; '  body <- ' MapMatd              A(as<MapMatd>(AA)); const MapMatd        B(as<MapMatd>(BB)); const MapVeci        ix(as<MapVeci>(ind)); const int            mB(B.cols()); for (int i = 0; i < mB; ++i)  { A.col(ix.coeff(i)-1) += B.col(i); } '  funRcpp <- cxxfunction(signature(AA = "matrix", BB ="matrix", ind = "integer"),                         body, "RcppEigen", incl)  set.seed(94253) K <- 100 V <- 100000 mat2 <-  mat <-  matrix(runif(K*V),nrow=K,ncol=V)  Vsub <- sample(1:V, 20) toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub)) mat[,Vsub] <- mat[,Vsub] + toinsert  invisible(funRcpp(mat2, toinsert, Vsub)) all.equal(mat, mat2) #[1] TRUE  library(microbenchmark) microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,                funRcpp(mat2, toinsert, Vsub)) # Unit: microseconds #                                  expr    min     lq  median      uq       max neval # mat[, Vsub] <- mat[, Vsub] + toinsert 49.273 49.628 50.3250 50.8075 20020.400   100 #         funRcpp(mat2, toinsert, Vsub)  6.450  6.805  7.6605  7.9215    25.914   100 

I think this is basically what @Joshua Ulrich proposed. His warnings regarding breaking R's functional paradigm apply.

I do the addition in C++, but it is trivial to change the function to only do assignment.

Obviously, if you can implement your whole loop in Rcpp, you avoid repeated function calls at the R level and will gain performance.

like image 74
Roland Avatar answered Oct 09 '22 07:10

Roland


Here's what I had in mind. This could probably be much sexier with Rcpp and friends, but I'm not as familiar with those tools.

#include <R.h> #include <Rinternals.h> #include <Rdefines.h> SEXP addCol(SEXP mat, SEXP loc, SEXP matAdd) {   int i, nr = nrows(mat), nc = ncols(matAdd), ll = length(loc);   if(ll != nc)     error("length(loc) must equal ncol(matAdd)");   if(TYPEOF(mat) != TYPEOF(matAdd))     error("mat and matAdd must be the same type");   if(nr != nrows(matAdd))     error("mat and matAdd must have the same number of rows");   if(TYPEOF(loc) != INTSXP)     error("loc must be integer");   int *iloc = INTEGER(loc);    switch(TYPEOF(mat)) {     case REALSXP:       for(i=0; i < ll; i++)         memcpy(&(REAL(mat)[(iloc[i]-1)*nr]),                &(REAL(matAdd)[i*nr]), nr*sizeof(double));       break;     case INTSXP:       for(i=0; i < ll; i++)         memcpy(&(INTEGER(mat)[(iloc[i]-1)*nr]),                &(INTEGER(matAdd)[i*nr]), nr*sizeof(int));       break;     default:       error("unsupported type");   }   return R_NilValue; } 

Put the above function in addCol.c, then run R CMD SHLIB addCol.c. Then in R:

addColC <- dyn.load("addCol.so")$addCol .Call(addColC, mat, Vsub, mat[,Vsub]+toinsert) 

The slight advantage to this approach over Roland's is that this only does the assignment. His function does the addition for you, which is faster, but also means you need a separate C/C++ function for every operation you need to do.

like image 27
Joshua Ulrich Avatar answered Oct 09 '22 07:10

Joshua Ulrich