Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Discrepancy between R and Matlab speed [duplicate]

Possible Duplicate:
Why are loops slow in R?

Consider the following task. A dataset has 40 variables for 20,000 "users". Each user has between 1 and 150 observations. All users are stacked in a matrix called data. The first column is the id of the user and identifies the user. All id are stored in a 20,000 X 1 matrix called userid.

Consider the following R code

useridl = length(userid)
itime=proc.time()[3]    
for (i in 1:useridl) {
temp =data[data[,1]==userid[i],]
   }
 etime=proc.time()[3]
 etime-itime

This code just goes through the 20,000 users, creating the temp matrix every time. With the subset of observations belonging to userid[i]. It takes about 6 minutes in a MacPro.

In MatLab, the same task

tic
for i=1:useridl
temp=data(data(:,1)==userid(i),:);
end
toc

takes 1 minute.

Why is R so much slower? This is standard task, I am using matrices in both cases. Any ideas?

like image 211
Hernan Avatar asked Nov 21 '25 01:11

Hernan


1 Answers

As @joran commented, that's bad R practice. Instead of repeatedly subsetting your original matrix, just put the subsets in a list once and then iterate over the list with lapply or similar.

# make example data
set.seed(21)
userid <- 1:1e4
obs <- sample(150, length(userid), TRUE)
users <- rep(userid, obs)
Data <- cbind(users,matrix(rnorm(40*sum(obs)),sum(obs),40))

# reorder so Data isn't sorted by userid
Data <- Data[order(Data[,2]),]
# note that you have to call the data.frame method explicitly,
# the default method returns a vector
system.time(temp <- split.data.frame(Data, Data[,1])) ## Returns times in seconds
#    user  system elapsed 
#    2.84    0.08    2.92 

My guess is that the garbage collector is slowing down your R code, since you're continually overwriting the temp object.

like image 176
Joshua Ulrich Avatar answered Nov 23 '25 17:11

Joshua Ulrich