Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I make this dplyr + data.table task faster?

I guess this is more of a dplyr than plyr question. For the sake of speed I am using data.table in some code I have written. During an intermediate step I have a table with some genomics data with ~32,000 rows:

> bedbin.dt
Source: local data table [32,138 x 4]
Groups: chr

   bin   start           site chr
1    2 3500000         ssCTCF   1
2    3 4000000 ssCTCF+Cohesin   1
3    3 4000000         ssCTCF   1
4    4 4500000         ucCTCF   1
5    4 4500000 ssCTCF+Cohesin   1
6    4 4500000 ssCTCF+Cohesin   1
7    4 4500000 ssCTCF+Cohesin   1
8    4 4500000         ssCTCF   1
9    4 4500000         ssCTCF   1
10   5 5000000         ssCTCF   1
.. ...     ...            ... ...

EDIT

Or the first hundred lines of data like so (thx to Ricardo Saporta for instructions)

bedbin.dt <- data.table(structure(list(bin = c("2", "3", "3", "4", "4", "4", "4", "4","4", "5", "5", "7", "7", "7", "7", "7", "7", "8", "8", "9", "9","11", "12", "14", "14", "14", "14", "14", "14", "14", "14", "15","15", "15", "15", "15", "15", "15", "15", "15", "15", "16", "16","17", "17", "17", "18", "20", "20", "20", "21", "21", "21", "21","21", "21", "21", "21", "21", "21", "22", "22", "5057", "5057","5057", "5057", "5059", "5059", "5059", "5059", "5059", "5060","5060", "5060", "5060", "5060", "5060", "5061", "5063", "5063","5064", "5064", "5064", "5064", "5064", "5064", "5064", "5064","5064", "5064", "5064", "5064", "5064", "5064", "5064", "5064","5064", "5064", "5064", "5064"), start = c(3500000L, 4000000L,4000000L, 4500000L, 4500000L, 4500000L, 4500000L, 4500000L, 4500000L,5000000L, 5000000L, 6000000L, 6000000L, 6000000L, 6000000L, 6000000L,6000000L, 6500000L, 6500000L, 7000000L, 7000000L, 8000000L, 8500000L,9500000L, 9500000L, 9500000L, 9500000L, 9500000L, 9500000L, 9500000L,9500000L, 10000000L, 10000000L, 10000000L, 10000000L, 10000000L,10000000L, 10000000L, 10000000L, 10000000L, 10000000L, 10500000L,10500000L, 11000000L, 11000000L, 11000000L, 11500000L, 12500000L,12500000L, 12500000L, 13000000L, 13000000L, 13000000L, 13000000L,13000000L, 13000000L, 13000000L, 13000000L, 13000000L, 13000000L,13500000L, 13500000L, 162500000L, 162500000L, 162500000L, 162500000L,163500000L, 163500000L, 163500000L, 163500000L, 163500000L, 164000000L,164000000L, 164000000L, 164000000L, 164000000L, 164000000L, 164500000L,165500000L, 165500000L, 166000000L, 166000000L, 166000000L, 166000000L,166000000L, 166000000L, 166000000L, 166000000L, 166000000L, 166000000L,166000000L, 166000000L, 166000000L, 166000000L, 166000000L, 166000000L,166000000L, 166000000L, 166000000L, 166000000L), site = c("ssCTCF","ssCTCF+Cohesin", "ssCTCF", "ucCTCF", "ssCTCF+Cohesin", "ssCTCF+Cohesin","ssCTCF+Cohesin", "ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF+Cohesin","ssCTCF", "ssCTCF+Cohesin", "ssCTCF+Cohesin", "ssCTCF", "ucCTCF","ucCTCF", "ucCTCF", "ssCTCF", "ssCTCF", "ssCTCF+Cohesin", "ssCTCF","ssCTCF+Cohesin", "ssCTCF", "ucCTCF", "ucCTCF", "ssCTCF", "ssCTCF+Cohesin","ssCTCF", "ssCTCF+Cohesin", "ssCTCF+Cohesin", "ssCTCF+Cohesin","ssCTCF+Cohesin", "ssCTCF", "ucCTCF", "ssCTCF+Cohesin", "ssCTCF","ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF","ssCTCF", "ssCTCF", "ucCTCF", "ucCTCF", "ucCTCF", "ssCTCF", "ssCTCF","ssCTCF", "ssCTCF", "ssCTCF+Cohesin", "ssCTCF", "ssCTCF", "ssCTCF","ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF+Cohesin", "ucCTCF", "ssCTCF","ssCTCF+Cohesin", "ssCTCF+Cohesin", "ssCTCF", "ucCTCF", "ssCTCF","ssCTCF+Cohesin", "ssCTCF", "ssCTCF", "ucCTCF", "ucCTCF", "ssCTCF","ucCTCF", "ssCTCF", "ucCTCF", "ucCTCF", "ssCTCF", "ssCTCF", "ucCTCF","ucCTCF", "ssCTCF", "ssCTCF", "ssCTCF", "ucCTCF", "ucCTCF", "ssCTCF","ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF", "ssCTCF", "ucCTCF","ucCTCF", "ssCTCF+Cohesin", "ucCTCF", "ucCTCF", "ucCTCF"), chr = structure(c(1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 20L, 20L,20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L,20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L,20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L), .Label = c("1","10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "2","3", "4", "5", "6", "7", "8", "9", "X"), class = "factor")), .Names = c("bin","start", "site", "chr"), sorted = "chr", class = c("data.table","data.frame"), row.names = c(NA, -100L)), key='chr')

END EDIT

I next want to create all possible combinations of each row versus every other (grouped by chr). This would form a query(join) onto some other data so I'm thinking best (and simplest) to precompute :

# grouped by chr column
bedbin.dt = group_by(bedbin.dt, chr)

# an outer like function
outerFun= function(dt)
  {
   unique(data.table(
    x=dt[rep(1:nrow(dt),each =nrow(dt)),],
    y=dt[rep.int(1:nrow(dt),times=nrow(dt)),]))
  }

> system.time((outer.bedbin.dt = do(bedbin.dt, outerFun1)))
   user  system elapsed 
 90.607  13.993 105.536

To my mind this is sloooowwww...although relatively compared to using data.frame, or base functions like by() or lapply() it is quite a lot quicker. However this is actually a smallish dataset I am testing it on.

So... I'm wondering if anyone has any ideas about a faster version of outerFun??? Is there a faster way than rep() or rep.int()?

like image 603
Stephen Henderson Avatar asked Oct 23 '13 20:10

Stephen Henderson


1 Answers

As Ricardo pointed out, it sounds like you simply want this:

bedbin.dt[, CJ(1:.N, 1:.N), by = chr]
like image 174
eddi Avatar answered Oct 16 '22 10:10

eddi