Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the higher-performance alternative to for-loops for subsetting data by group-id?

A recurring analysis paradigm I encounter in my research is the need to subset based on all different group id values, performing statistical analysis on each group in turn, and putting the results in an output matrix for further processing/summarizing.

How I typically do this in R is something like the following:

data.mat <- read.csv("...")  
groupids <- unique(data.mat$ID)  #Assume there are then 100 unique groups
  
results <- matrix(rep("NA",300),ncol=3,nrow=100)  

for(i in 1:100) {  
  tempmat <- subset(data.mat,ID==groupids[i])  

  # Run various stats on tempmat (correlations, regressions, etc), checking to  
  # make sure this specific group doesn't have NAs in the variables I'm using  
  # and assign results to x, y, and z, for example.  

  results[i,1] <- x  
  results[i,2] <- y  
  results[i,3] <- z  
}

This ends up working for me, but depending on the size of the data and the number of groups I'm working with, this can take up to three days.

Besides branching out into parallel processing, is there any "trick" for making something like this run faster? For instance, converting the loops into something else (something like an apply with a function containing the stats I want to run inside the loop), or eliminating the need to actually assign the subset of data to a variable?

Edit:

Maybe this is just common knowledge (or sampling error), but I tried subsetting with brackets in some of my code rather than using the subset command, and it seemed to provide a slight performance gain which surprised me. I have some code I used and output below using the same object names as above:

system.time(for(i in 1:1000){data.mat[data.mat$ID==groupids[i],]})  
   user  system elapsed  
 361.41   92.62  458.32
system.time(for(i in 1:1000){subset(data.mat,ID==groupids[i])})  
   user  system elapsed   
 378.44  102.03  485.94

Update:

In one of the answers, jorgusch suggested that I use the data.table package to speed up my subsetting. So, I applied it to a problem I ran earlier this week. In a dataset with a little over 1,500,000 rows, and 4 columns (ID,Var1,Var2,Var3), I wanted to calculate two correlations in each group (indexed by the "ID" variable). There are slightly more than 50,000 groups. Below is my initial code (which is very similar to the above):

data.mat <- read.csv("//home....")  
groupids <- unique(data.mat$ID)
  
results <- matrix(rep("NA",(length(groupids) * 3)),ncol=3,nrow=length(groupids))  

for(i in 1:length(groupids)) {  
  tempmat <- data.mat[data.mat$ID==groupids[i],] 

  results[i,1] <- groupids[i]  
  results[i,2] <- cor(tempmat$Var1,tempmat$Var2,use="pairwise.complete.obs")  
  results[i,3] <- cor(tempmat$Var1,tempmat$Var3,use="pairwise.complete.obs")    

}  

I'm re-running that right now for an exact measure of how long that took, but from what I remember, I started it running when I got into the office in the morning and it finished sometime in mid-afternoon. Figure 5-7 hours.

Restructuring my code to use data.table....

data.mat <- read.csv("//home....")  
data.mat <- data.table(data.mat)  
  
testfunc <- function(x,y,z) {  
  temp1 <- cor(x,y,use="pairwise.complete.obs")  
  temp2 <- cor(x,z,use="pairwise.complete.obs")  
  res <- list(temp1,temp2)  
  res  
}  

system.time(test <- data.mat[,testfunc(Var1,Var2,Var3),by="ID"])  
 user  system  elapsed  
16.41    0.05    17.44  

Comparing the results using data.table to the ones I got from using a for loop to subset all IDs and record results manually, they seem to have given me the same answers(though I'll have to check that a bit more thoroughly). That looks to be a pretty big speed increase.

Update 2:

Running the code using subsets finally finished up again:

   user     system   elapsed  
17575.79  4247.41   23477.00

Update 3:

I wanted to see if anything worked out differently using the plyr package that was also recommended. This is my first time using it, so I may have done things somewhat inefficiently, but it still helped substantially compared to the for loop with subsetting.

Using the same variables and setup as before...

data.mat <- read.csv("//home....")  
system.time(hmm <- ddply(data.mat,"ID",function(df)c(cor(df$Var1,df$Var2,  use="pairwise.complete.obs"),cor(df$Var1,df$Var3,use="pairwise.complete.obs"))))  
  user  system elapsed  
250.25    7.35  272.09  
like image 655
Adam Avatar asked Mar 27 '10 00:03

Adam


1 Answers

This is pretty much exactly what the plyr package is designed to make easier. However it's unlikely that it will make things much faster - most of the time is probably spent doing the statistics.

like image 97
hadley Avatar answered Sep 26 '22 02:09

hadley