Horrible title question, but this is what I am trying to achieve. For Table1 I want to add the Column "BETWEEN", verifying if the "POSITION" falls between any of the "START" and "STOP" values for the corresponding "BIN" in Table2.
Table1. BIN names (character) and POSITION in BIN (numeric):
BIN POSITION
1 12
1 52
1 86
7 6
7 22
X 112
X 139
MT 3
MT 26
Table2: BIN name (character) and START and STOP positions (numeric)
BIN START STOP
1 2 64
1 90 110
7 20 100
7 105 200
X 1 5
MT 1 1000
And the desired result - Table 1 with "BETWEEN":
CHROM POSITION BETWEEN
1 12 TRUE
1 52 TRUE
1 86 FALSE
7 6 FALSE
7 22 TRUE
X 112 FALSE
X 139 FALSE
MT 3 TRUE
MT 26 TRUE
My Table 1 has about 4,000,000 rows, and Table 2 about 500,000 rows, and anything I came up with was very slow.
As an example of bigger tables, use the following:
positions <- seq(1,100000,10)
bins <- c("A","B","C","D","E","F","G","H","I","J")
tab1 <- data.table(bin = rep(bins,1,each=length(positions)), pos = rep(positions,10))
tab2 <- data.table(bin = rep(bins,1,each=2000), start = seq(5,100000,50), stop = start+25)
The desired output would be:
tab1
bin pos between
1: A 1 FALSE
2: A 11 TRUE
3: A 21 TRUE
4: A 31 FALSE
5: A 41 FALSE
The following method requires that for a given bin, the bins are mutually exclusive. (e.g. you cant have bin A with bounds 1-5 and another bin A with bounds 4-8.) Also, I modified your example a bit.
positions <- seq(1,100000,10)
bins <- c("A","B","C","D","E","F","G","H","I","J")
tab1 <- data.table(bin = rep(bins,1,each=length(positions)), pos = rep(positions,10))
setkey(tab1,"bin","pos")
tab2 <- data.table(bin = rep(bins,1,each=2000), start = seq(5,100000,50))
tab2[, end := start+25]
tab2[,pos:=start]
setkey(tab2,"bin","pos")
x<-tab2[tab1, roll=TRUE, nomatch=0]
tab2[,pos:=end]
setkey(tab2,"bin","pos")
y<-tab2[tab1, roll=-Inf, nomatch=0]
setkey(x,"bin","pos","start")
setkey(y,"bin","pos","start")
inBin<-x[y,nomatch=0]
inBin[, between:=TRUE]
setkey(tab1,"bin","pos")
setkey(inBin,"bin","pos")
result<-inBin[,list(bin,pos,between)][tab1]
result[is.na(between), between:=FALSE]
I don't have the time to explain my solution in depth right now. Instead I'll take the cheap way out and refer you to research the roll
parameter of data.table. The basic methodology above is that I'm joining tab1 and tab2, rolling pos forward to the nearest end bound. Then I join tab1 and tab2, rolling pos backward to the nearest start bound. Then I do an inner join on the those two sets, giving me all rows in tab1 which fall inside the bounds of a bin. From that point, it's just grunt work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With