Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lapply and do.call running very slowly?

I have a data frame that is some 35,000 rows, by 7 columns. it looks like this:

head(nuc)

  chr feature    start      end   gene_id    pctAT    pctGC length
1   1     CDS 67000042 67000051 NM_032291 0.600000 0.400000     10
2   1     CDS 67091530 67091593 NM_032291 0.609375 0.390625     64
3   1     CDS 67098753 67098777 NM_032291 0.600000 0.400000     25
4   1     CDS 67101627 67101698 NM_032291 0.472222 0.527778     72
5   1     CDS 67105460 67105516 NM_032291 0.631579 0.368421     57
6   1     CDS 67108493 67108547 NM_032291 0.436364 0.563636     55

gene_id is a factor, that has about 3,500 unique levels. I want to, for each level of gene_id get the min(start), max(end), mean(pctAT), mean(pctGC), and sum(length).

I tried using lapply and do.call for this, but it's taking forever +30 minutes to run. the code I'm using is:

nuc_prof = lapply(levels(nuc$gene_id), function(gene){
  t = nuc[nuc$gene_id==gene, ]
  return(list(gene_id=gene, start=min(t$start), end=max(t$end), pctGC =
              mean(t$pctGC), pct = mean(t$pctAT), cdslength = sum(t$length))) 
})
nuc_prof = do.call(rbind, nuc_prof)

I'm certain I'm doing something wrong to slow this down. I haven't waited for it to finish as I'm sure it can be faster. Any ideas?

like image 350
Davy Kavanagh Avatar asked Jun 15 '12 15:06

Davy Kavanagh


People also ask

Is Lapply faster than Sapply?

Difference between lapply() and sapply() functions: lapply() and sapply() functions are used to perform some operations in a list of objects. sapply() function in R is more efficient than lapply() in the output returned because sapply() stores values directly into a vector.

Is Sapply faster than for loop R?

The sapply() was faster than the for() loop, but how much faster depends on the values of n . For n = 100 the sapply() is 15 times slower than the vectorized version, and the for() is 23 times slower than the sapply() !


1 Answers

Since I'm in an evangelizing mood ... here's what the fast data.table solution would look like:

library(data.table)
dt <- data.table(nuc, key="gene_id")

dt[,list(A=min(start),
         B=max(end),
         C=mean(pctAT),
         D=mean(pctGC),
         E=sum(length)), by=key(dt)]
#      gene_id        A        B         C         D   E
# 1: NM_032291 67000042 67108547 0.5582567 0.4417433 283
# 2:       ZZZ 67000042 67108547 0.5582567 0.4417433 283
like image 120
Josh O'Brien Avatar answered Oct 17 '22 02:10

Josh O'Brien