Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up a for loop in R for a nested matrix matching and colSums

I have an apparently simple problem for which I require a faster R implementation than the one I developed

I initialize random seed and dimensions for this example:

set.seed(1)
d1<-400
d2<-20000
d3<-50

I have a matrix X, with dimensions d1 x d2:

X<-as.data.frame(matrix(rnorm(d1*d2),nrow=d1,ncol=d2))
rownames(X)<-paste0("row",1:nrow(X))
colnames(X)<-paste0("col",1:ncol(X))

And a vector u with d1 row indexes:

u<-sample(rownames(X),nrow(X),replace=TRUE)

I have also a matrix C with named rows and dimensions d3 x d2:

C<-matrix(rnorm(d3*d2),nrow=d3,ncol=d2)
rownames(C)<-sample(rownames(X),nrow(C),replace=FALSE)

Now, with the following very slow loop I am filling the matrix C with the sums of matching X rows:

system.time(
    for(i in 1:nrow(C)){
        indexes<-which(u==rownames(C)[i])
        C[i,] <- colSums(X[indexes,])
    }
)

This operation takes approximately 11.5 seconds on my PC, but I am sure it could be sped up by avoiding the for loop. Any ideas? Thanks a lot!

like image 963
Federico Giorgi Avatar asked Jul 04 '19 15:07

Federico Giorgi


1 Answers

Just use matrixStats::colSums2 with the option to pass row indexes & move rownames() outside the loop (X need to be converted to matrix):

Xm <- as.matrix(X)
names_of_rows <- rownames(C)
system.time(for (i in 1:nrow(C)) {
  indexes <- which(u == names_of_rows[i])
  C[i, ] <-  matrixStats::colSums2(Xm, rows = indexes)
})
# 0.03 sek
like image 84
minem Avatar answered Sep 19 '22 22:09

minem