Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient method to convert set of ranges in a dataframe to frequency of individual elements?

Tags:

dataframe

r

I am working in R. I have a data-frame which contains the start and end positions on a chromosome (where integer represents a coordinate on the chromosome) Ex:

start     end
1         5
3         7
4         10
12        7            (inverted is also allowed)
8         15

What I want is to count how many times a coordinate is present in all these ranges. So, for the above example, the output would be this:

position     count
1            1
2            1
3            2
4            3
5            3
6            2
7            3
8            3
9            3
10           3
11           2
12           2
13           1
14           1
15           1

I have 62000+ such ranges, where each range is at least 1000 positions long. I know how to do this conversion but I don't know how to do this efficiently, that is with in couple of seconds.

Current (inefficient code)

positions <- c()
for(i in seq(nrow(a))){
  positions <- c(positions, seq(a[i,3], a[i,4]))
}
table(positions)

"a" is my data-frame and the start and end coordinates are in the third and forth column respectively.

One of the columns in the data-frame contains characters, so for using apply I would either need to create a new data-frame (consuming extra space) or would need to convert to integers inside the apply function (extra time). Sorry, for not informing about this earlier.

like image 562
Manish Goel Avatar asked May 04 '17 12:05

Manish Goel


Video Answer


1 Answers

For a very fast code with data.table see the answer from docendo discimus
(+ benchmark)

Here is the benchmark of some other solutions:

set.seed(42)
N <- 1000
df <- data.frame(start=sample.int(10*N, N))
df$end <- df$start + sample(3:20, N, rep=TRUE) 

library("microbenchmark")
microbenchmark(unit = "relative",
ori =  { positions <- c()
  for(i in seq(nrow(df))){
    positions <- c(positions, seq(df[i,1], df[i,2]))
  }
  table(positions) },
a  = table(unlist(apply(df, 1, function(x) x[1]:x[2]))),  # my solution, similar: KenS, EricSchutte
m1 = table(unlist(mapply(seq, df$start, df$end))),        # my variant of Sotos' solution
m2 = table(unlist(mapply(':', df$start, df$end))),        # my variant of Sotos' solution
M1 = table(unlist(Map(seq, df$start, df$end))),           # my variant of Sotos' solution
M2 = table(unlist(Map(':', df$start, df$end))),           # Sotos
l  = table(unlist(lapply(seq_len(nrow(df)), function(i) seq(df$start[i], df$end[i])))),    # lmo
t  = { temp <- unlist(lapply(seq_len(nrow(df)), function(i) seq(df$start[i], df$end[i])))  # lmo tabulate()
cbind(sort(unique(temp)), tabulate(temp)) },
d  = table(do.call(c, mapply(seq, df$start, df$end))),     # @989 (comment to the answer from Sotos)
dd = table(do.call(c, mapply(seq.int, df$start, df$end))), # docendo discimus (comment to this answer)
f  = {  pos <- data.frame(x=(min(df):max(df)),n=0)         # Andrew Gustar

for(i in seq_along(df$start)){
  low=min(df$start[i])-pos$x[1]+1
  high=max(df$end[i])-pos$x[1]+1
  pos$n[low:high] <- pos$n[low:high]+1
} }
)
# Unit: relative
# expr      min       lq     mean   median       uq       max neval    cld
#  ori 7.163767 7.219099 7.573688 7.379160 7.912435  7.899586   100     e 
#    a 1.194627 1.194855 1.211432 1.209485 1.213118  1.711994   100 a     
#   m1 1.645659 1.660294 1.711141 1.686973 1.710461  2.217141   100  b    
#   m2 1.005302 1.007125 1.017115 1.009618 1.017207  1.576201   100 a     
#   M1 1.642688 1.645174 1.733173 1.673924 1.686253  2.218028   100  b    
#   M2 1.000000 1.000000 1.000000 1.000000 1.000000  1.000000   100 a     
#    l 3.487924 3.512732 3.801530 3.665725 4.188701  4.216375   100    d  
#    t 2.670636 2.711345 2.961449 2.869190 3.066150  3.745984   100   c   
#    d 1.652376 1.650798 1.721377 1.665901 1.712064  2.187129   100  b    
#   dd 1.040941 1.045652 1.060601 1.047534 1.053305  1.592163   100 a     
#    f 8.287098 8.486854 9.052884 9.046376 9.126318 25.210722   100      f

The solution with tabulate() produces warnings.

like image 200
jogo Avatar answered Sep 21 '22 15:09

jogo