I am working in R. I have a data-frame which contains the start and end positions on a chromosome (where integer represents a coordinate on the chromosome) Ex:
start end
1 5
3 7
4 10
12 7 (inverted is also allowed)
8 15
What I want is to count how many times a coordinate is present in all these ranges. So, for the above example, the output would be this:
position count
1 1
2 1
3 2
4 3
5 3
6 2
7 3
8 3
9 3
10 3
11 2
12 2
13 1
14 1
15 1
I have 62000+ such ranges, where each range is at least 1000 positions long. I know how to do this conversion but I don't know how to do this efficiently, that is with in couple of seconds.
Current (inefficient code)
positions <- c()
for(i in seq(nrow(a))){
positions <- c(positions, seq(a[i,3], a[i,4]))
}
table(positions)
"a" is my data-frame and the start and end coordinates are in the third and forth column respectively.
One of the columns in the data-frame contains characters, so for using apply
I would either need to create a new data-frame (consuming extra space) or would need to convert to integers inside the apply function (extra time). Sorry, for not informing about this earlier.
For a very fast code with data.table
see the answer from docendo discimus
(+ benchmark)
Here is the benchmark of some other solutions:
set.seed(42)
N <- 1000
df <- data.frame(start=sample.int(10*N, N))
df$end <- df$start + sample(3:20, N, rep=TRUE)
library("microbenchmark")
microbenchmark(unit = "relative",
ori = { positions <- c()
for(i in seq(nrow(df))){
positions <- c(positions, seq(df[i,1], df[i,2]))
}
table(positions) },
a = table(unlist(apply(df, 1, function(x) x[1]:x[2]))), # my solution, similar: KenS, EricSchutte
m1 = table(unlist(mapply(seq, df$start, df$end))), # my variant of Sotos' solution
m2 = table(unlist(mapply(':', df$start, df$end))), # my variant of Sotos' solution
M1 = table(unlist(Map(seq, df$start, df$end))), # my variant of Sotos' solution
M2 = table(unlist(Map(':', df$start, df$end))), # Sotos
l = table(unlist(lapply(seq_len(nrow(df)), function(i) seq(df$start[i], df$end[i])))), # lmo
t = { temp <- unlist(lapply(seq_len(nrow(df)), function(i) seq(df$start[i], df$end[i]))) # lmo tabulate()
cbind(sort(unique(temp)), tabulate(temp)) },
d = table(do.call(c, mapply(seq, df$start, df$end))), # @989 (comment to the answer from Sotos)
dd = table(do.call(c, mapply(seq.int, df$start, df$end))), # docendo discimus (comment to this answer)
f = { pos <- data.frame(x=(min(df):max(df)),n=0) # Andrew Gustar
for(i in seq_along(df$start)){
low=min(df$start[i])-pos$x[1]+1
high=max(df$end[i])-pos$x[1]+1
pos$n[low:high] <- pos$n[low:high]+1
} }
)
# Unit: relative
# expr min lq mean median uq max neval cld
# ori 7.163767 7.219099 7.573688 7.379160 7.912435 7.899586 100 e
# a 1.194627 1.194855 1.211432 1.209485 1.213118 1.711994 100 a
# m1 1.645659 1.660294 1.711141 1.686973 1.710461 2.217141 100 b
# m2 1.005302 1.007125 1.017115 1.009618 1.017207 1.576201 100 a
# M1 1.642688 1.645174 1.733173 1.673924 1.686253 2.218028 100 b
# M2 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
# l 3.487924 3.512732 3.801530 3.665725 4.188701 4.216375 100 d
# t 2.670636 2.711345 2.961449 2.869190 3.066150 3.745984 100 c
# d 1.652376 1.650798 1.721377 1.665901 1.712064 2.187129 100 b
# dd 1.040941 1.045652 1.060601 1.047534 1.053305 1.592163 100 a
# f 8.287098 8.486854 9.052884 9.046376 9.126318 25.210722 100 f
The solution with tabulate()
produces warnings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With