Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if a value in one table (X) is between the values in two columns in another table (Y) with R data.table

Tags:

r

data.table

Horrible title question, but this is what I am trying to achieve. For Table1 I want to add the Column "BETWEEN", verifying if the "POSITION" falls between any of the "START" and "STOP" values for the corresponding "BIN" in Table2.

Table1. BIN names (character) and POSITION in BIN (numeric):

  BIN    POSITION
    1          12
    1          52
    1          86
    7           6
    7          22
    X         112
    X         139
   MT           3
   MT          26

Table2: BIN name (character) and START and STOP positions (numeric)

  BIN    START    STOP
    1        2      64
    1       90     110
    7       20     100
    7      105     200
    X        1       5
   MT        1    1000

And the desired result - Table 1 with "BETWEEN":

CHROM    POSITION      BETWEEN
    1          12         TRUE
    1          52         TRUE
    1          86        FALSE
    7           6        FALSE
    7          22         TRUE
    X         112        FALSE
    X         139        FALSE
   MT           3         TRUE
   MT          26         TRUE

My Table 1 has about 4,000,000 rows, and Table 2 about 500,000 rows, and anything I came up with was very slow.

As an example of bigger tables, use the following:

positions <- seq(1,100000,10)
bins <- c("A","B","C","D","E","F","G","H","I","J")

tab1 <- data.table(bin = rep(bins,1,each=length(positions)), pos = rep(positions,10))

tab2 <- data.table(bin = rep(bins,1,each=2000), start = seq(5,100000,50), stop = start+25)

The desired output would be:

tab1
        bin   pos    between
     1:   A     1    FALSE
     2:   A    11    TRUE
     3:   A    21    TRUE
     4:   A    31    FALSE
     5:   A    41    FALSE
like image 883
Pete Avatar asked Nov 01 '22 20:11

Pete


1 Answers

The following method requires that for a given bin, the bins are mutually exclusive. (e.g. you cant have bin A with bounds 1-5 and another bin A with bounds 4-8.) Also, I modified your example a bit.

positions <- seq(1,100000,10)
bins <- c("A","B","C","D","E","F","G","H","I","J")
tab1 <- data.table(bin = rep(bins,1,each=length(positions)), pos = rep(positions,10))
setkey(tab1,"bin","pos")

tab2 <- data.table(bin = rep(bins,1,each=2000), start = seq(5,100000,50))
tab2[, end := start+25]

tab2[,pos:=start]
setkey(tab2,"bin","pos")
x<-tab2[tab1, roll=TRUE, nomatch=0]

tab2[,pos:=end]
setkey(tab2,"bin","pos")
y<-tab2[tab1, roll=-Inf, nomatch=0]

setkey(x,"bin","pos","start")
setkey(y,"bin","pos","start")
inBin<-x[y,nomatch=0]
inBin[, between:=TRUE]

setkey(tab1,"bin","pos")
setkey(inBin,"bin","pos")

result<-inBin[,list(bin,pos,between)][tab1]
result[is.na(between), between:=FALSE]

I don't have the time to explain my solution in depth right now. Instead I'll take the cheap way out and refer you to research the roll parameter of data.table. The basic methodology above is that I'm joining tab1 and tab2, rolling pos forward to the nearest end bound. Then I join tab1 and tab2, rolling pos backward to the nearest start bound. Then I do an inner join on the those two sets, giving me all rows in tab1 which fall inside the bounds of a bin. From that point, it's just grunt work.

like image 67
Ben Avatar answered Nov 08 '22 05:11

Ben