Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

replacing for loops in a function with vector calculations to speed up R

Tags:

r

Say I have some data in data frame d1, that describes how frequently different sample individuals eat different foods, and a final column describing whether or not those foods are cool to eat. The data are structured like this.

OTU.ID<- c('pizza','taco','pizza.taco','dirt')
s1<-c(5,20,14,70)
s2<-c(99,2,29,5)
s3<-c(44,44,33,22)
cool<-c(1,1,1,0)

d1<-data.frame(OTU.ID,s1,s2,s3,cool)
print(d1)
      OTU.ID s1 s2 s3 cool
1      pizza  5 99 44    1
2       taco 20  2 44    1
3 pizza.taco 14 29 33    1
4       dirt 70  5 22    0

I have written a function that, for each sample, s1:s3, the number of cool foods that were consumed, and the total number of foods that were consumed. It runs as a for loop on each line of the data table (which is extremely slow).

cool.food.abundance<- function(food.table){
samps<-colnames(food.table)
#remove column names that are not sample names
samps<-samps[!samps %in% c("OTU.ID","cool")]

#create output vectors for for loop
    id<-c()
    cool.foods<-c()
    all.foods<-c()
    #run a loop that stores output ids and results as vectors
    for(i in 1:length(samps)){
        x<- samps[i]
        y1<-sum(food.table[samps[i]]*food.table$cool)
        y2<-sum(food.table[samps[i]])
        id<-c(id,x)
        cool.foods<-c(cool.foods,y1)
        all.foods<-c(all.foods,y2)
    }
    #save results as a data frame and return the data frame object
    results<-data.frame(id,cool.foods,all.foods)
    return(results)
}

So, if you run this function, you will get a new table of sample IDs, the number of cool foods that sample ate, and the total number of foods that sample ate.

cool.food.abundance(d1)
  id cool.foods all.foods
1 s1         39       109
2 s2        130       135
3 s3        121       143

How can I replace this for-loop with vector calculations to speed this up? I would really like to be able for the function to operate on dataframes loaded with the fread function in the data.table package.

like image 214
colin Avatar asked Feb 09 '23 16:02

colin


2 Answers

You can try

library(data.table)#v1.9.5+
dcast(melt(setDT(d1), id.var=c('OTU.ID', 'cool'))[,
         sum(value) ,.(cool, variable)], variable~c('notcool.foods',
       'cool.foods')[cool+1L], value.var='V1')[,
    all.foods:= cool.foods+notcool.foods][, notcool.foods:=NULL]
#      variable cool.foods all.foods
#1:       s1         39       109
#2:       s2        130       135
#3:       s3        121       143

Or instead of using dcast we can summarise the result (as in @jeremycg's post) as there are only two groups

 melt(setDT(d1), id.var=c('OTU.ID', 'cool'), variable.name='id')[,
     list(all.foods=sum(value), cool.foods=sum(value[cool==1])) , id]
 #   id all.foods cool.foods
 #1: s1       109         39
 #2: s2       135        130
 #3: s3       143        121

Or you can use base R

nm1 <- paste0('s', 1:3)
res <- t(addmargins(rowsum(as.matrix(d1[nm1]), group=d1$cool),1)[-1,])

colnames(res) <- c('cool.foods', 'all.foods')
res
 #   cool.foods all.foods
 #s1         39       109
 #s2        130       135
 #s3        121       143
like image 92
akrun Avatar answered Feb 12 '23 05:02

akrun


Here's how I would do it, with reshape2 and dplyr:

library(reshape2)
library(dplyr)
d1 <- melt(d1, id = c("OTU.ID", "cool"))
d1 %>% group_by(variable) %>% 
       summarise(all.foods = sum(value), cool.foods = sum(value[cool == 1]))
like image 38
jeremycg Avatar answered Feb 12 '23 06:02

jeremycg