What is the fastest method in R for reading and writing a subset of columns from a very large matrix. I attempt a solution with data.table but need a fast way to extract a sequence of columns?
Short Answer: The expensive part of the operation is assignment. Thus the solution is to stick with a matrix and use Rcpp and C++ to modify the matrix in place. There are two excellent answers below with examples.[for those applying to other problems be sure to read the disclaimers in the solutions!]. Scroll to the bottom of the question for some more lessons learned.
This is my first Stack Overflow question- I greatly appreciate your time in taking a look and I apologize if I've left anything out. I'm working on an R package where I have a performance bottleneck from subsetting and writing to portions of a matrix (NB for statisticians the application is updating sufficient statistics after processing each data point). The individual operations are incredibly fast but the sheer number of them requires it to be as fast as possible. The simplest version of the idea is a matrix of dimension K by V where K is generally between 5 and 1000 and V can be between 1000 and 1,000,000.
set.seed(94253) K <- 100 V <- 100000 mat <- matrix(runif(K*V),nrow=K,ncol=V)
we then end up performing a calculation on a subset of the columns and adding this into the full matrix. thus naively it looks like
Vsub <- sample(1:V, 20) toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub)) mat[,Vsub] <- mat[,Vsub] + toinsert library(microbenchmark) microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert)
because this is done so many times it can be quite slow as a result of R's copy-on-change semantics (but see the lessons learned below, modification can actually happen in place in some cricumstances).
For my problem the object need not be a matrix (and I'm sensitive to the difference as outlined here Assign a matrix to a subset of a data.table). I always want the full column and so the list structure of a data frame is fine. My solution was to use Matthew Dowle's awesome data.table package. The write can be done extraordinarily quickly using set(). Unfortunately getting the value is somewhat more complicated. We have to call the variables setting with=FALSE which dramatically slows things down.
library(data.table) DT <- as.data.table(mat) set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert))
Within the set() function using i=NULL to reference all rows is incredibly fast but (presumably due to the way things are stored under the hood) there is no comparable option for j. @Roland notes in the comments that one option would be to convert to a triple representation (row number, col number, value) and use data.tables binary search to speed retrieval. I tested this manually and while it is quick, it does approximately triple the memory requirements for the matrix. I would like to avoid this if possible.
Following the question here: Time in getting single elemets from data.table and data.frame objects. Hadley Wickham gave an incredibly fast solution for a single index
Vone <- Vsub[1] toinsert.one <- toinsert[,1] set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one))
however since the .subset2(DT,i) is just DT[[i]] without the methods dispatch there is no way (to my knowledge) to grab several columns at once although it certainly seems like it should be possible. As in the previous question, it seems like since we can overwrite the values quickly we should be able to read them quickly.
Any suggestions? Also please let me know if there is a better solution than data.table for this problem. I realized its not really the intended use case in many respects but I'm trying to avoid porting the whole series of operations to C.
Here are a sequence of timings of elements discussed- the first two are all columns, the second two are just one column.
microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert, set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert)), mat[,Vone] <- mat[,Vone] + toinsert.one, set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one)), times=1000L) Unit: microseconds expr min lq median uq max neval Matrix 51.970 53.895 61.754 77.313 135.698 1000 Data.Table 4751.982 4962.426 5087.376 5256.597 23710.826 1000 Matrix Single Col 8.021 9.304 10.427 19.570 55303.659 1000 Data.Table Single Col 6.737 7.700 9.304 11.549 89.824 1000
Comments identified the most expensive part of the operation as the assignment process. Both solutions give answers based on C code which modify the matrix in place breaking R convention of not modifying the argument to a function but providing a much faster result.
Hadley Wickham stopped by in the comments to note that the matrix modification is actually done in place as long as the object mat is not referenced elsewhere (see http://adv-r.had.co.nz/memory.html#modification-in-place). This points to an interesting and subtle point. I was performing my evaluations in RStudio. RStudio as Hadley notes in his book creates an additional reference for every object not within a function. Thus while in the performance case of a function the modification would happen in place, at the command line it was producing a copy-on-change effect. Hadley's package pryr has some nice functions for tracking references and addresses of memory.
Fun with Rcpp:
You can use Eigen's Map class to modify an R object in place.
library(RcppEigen) library(inline) incl <- ' using Eigen::Map; using Eigen::MatrixXd; using Eigen::VectorXi; typedef Map<MatrixXd> MapMatd; typedef Map<VectorXi> MapVeci; ' body <- ' MapMatd A(as<MapMatd>(AA)); const MapMatd B(as<MapMatd>(BB)); const MapVeci ix(as<MapVeci>(ind)); const int mB(B.cols()); for (int i = 0; i < mB; ++i) { A.col(ix.coeff(i)-1) += B.col(i); } ' funRcpp <- cxxfunction(signature(AA = "matrix", BB ="matrix", ind = "integer"), body, "RcppEigen", incl) set.seed(94253) K <- 100 V <- 100000 mat2 <- mat <- matrix(runif(K*V),nrow=K,ncol=V) Vsub <- sample(1:V, 20) toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub)) mat[,Vsub] <- mat[,Vsub] + toinsert invisible(funRcpp(mat2, toinsert, Vsub)) all.equal(mat, mat2) #[1] TRUE library(microbenchmark) microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert, funRcpp(mat2, toinsert, Vsub)) # Unit: microseconds # expr min lq median uq max neval # mat[, Vsub] <- mat[, Vsub] + toinsert 49.273 49.628 50.3250 50.8075 20020.400 100 # funRcpp(mat2, toinsert, Vsub) 6.450 6.805 7.6605 7.9215 25.914 100
I think this is basically what @Joshua Ulrich proposed. His warnings regarding breaking R's functional paradigm apply.
I do the addition in C++, but it is trivial to change the function to only do assignment.
Obviously, if you can implement your whole loop in Rcpp, you avoid repeated function calls at the R level and will gain performance.
Here's what I had in mind. This could probably be much sexier with Rcpp and friends, but I'm not as familiar with those tools.
#include <R.h> #include <Rinternals.h> #include <Rdefines.h> SEXP addCol(SEXP mat, SEXP loc, SEXP matAdd) { int i, nr = nrows(mat), nc = ncols(matAdd), ll = length(loc); if(ll != nc) error("length(loc) must equal ncol(matAdd)"); if(TYPEOF(mat) != TYPEOF(matAdd)) error("mat and matAdd must be the same type"); if(nr != nrows(matAdd)) error("mat and matAdd must have the same number of rows"); if(TYPEOF(loc) != INTSXP) error("loc must be integer"); int *iloc = INTEGER(loc); switch(TYPEOF(mat)) { case REALSXP: for(i=0; i < ll; i++) memcpy(&(REAL(mat)[(iloc[i]-1)*nr]), &(REAL(matAdd)[i*nr]), nr*sizeof(double)); break; case INTSXP: for(i=0; i < ll; i++) memcpy(&(INTEGER(mat)[(iloc[i]-1)*nr]), &(INTEGER(matAdd)[i*nr]), nr*sizeof(int)); break; default: error("unsupported type"); } return R_NilValue; }
Put the above function in addCol.c
, then run R CMD SHLIB addCol.c. Then in R:
addColC <- dyn.load("addCol.so")$addCol .Call(addColC, mat, Vsub, mat[,Vsub]+toinsert)
The slight advantage to this approach over Roland's is that this only does the assignment. His function does the addition for you, which is faster, but also means you need a separate C/C++ function for every operation you need to do.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With