I have a 2 tables (data & reference; toy example below). These tables have START and END positions that I'd like to check for overlaps (using something like foverlaps from the data.table package) and then split the values as shows below.
>data <- data.table(ID=c(1,2,3), Chrom=c(1,1,2), Start=c(1,500,1000), End=c(900,5000,5000), Probes=c(899,4500,4500))
>Ref.table <- data.table(Chrom=c(1,2), Split=c(1000,2000))
>Ref.table
Chrom Split
1 1000
2 2000
>data
ID Chrom Start End Probes
1 1 1 900 899
2 1 500 5000 4500
3 2 1000 5000 4000
As you can see, ID 1 has no overlap with the reference table, so it would be left alone. However, IDs 2&3, I'd like to split based on Ref.table.
The resulting table I'd like to get is:
>result
ID Chrom Start End Probes
1 1 1 900 899
2 1 500 1000 500
2 1 1001 5000 4000
3 2 1000 2000 1000
3 2 2001 5000 3000
As I'm sure you can see, there are two parts to this: 1. Split the range into two columns based on a separate table 2. Split the # probes proportionally between the two parts
I've been searching for an R package that can do this (split ranges by Chromosome arm), but haven't been able to find one that does as shown above. Any links to functions packages would be appreciated, but I'm also willing to code this myself...with a little help.
So far, I've only been able to use foverlaps to determine if there are overlaps: example:
>foverlaps(Ref.table[data[14]$Chrom], data[14], which=TRUE)
xid yid
1: 1 1
Here's a possible foverlaps
solution (as mentioned in the Q).
The first two steps are simple and pretty much idiomatic, add an End column to Ref.table
so we will have overlaping intervals, then key both data sets by Chrom
and the interval columns (in v 1.9.5+ you can now specify by.x
and by.y
instead) and simply run foverlaps
library(data.table)
setDT(Ref.table)[, End := Split]
setkey(Ref.table)
setkey(setDT(data), Chrom, Start, End)
res <- foverlaps(data, Ref.table)
res
# Chrom Split End ID Start i.End Probes
# 1: 1 NA NA 1 1 900 899
# 2: 1 1000 1000 2 500 5000 4500
# 3: 2 2000 2000 3 1000 5000 4000
Now that we have the overlaps, we need to increase the data set size according to our matches. We can condition this on is.na(Split)
(which means no overlap was found). I'm not sure if this part could be done more efficiently
res2 <- res[, if(is.na(Split)) .SD else rbind(.SD, .SD), by = .(ID, Chrom)]
## Or, if you only have one row per group, maybe
## res2 <- res[, if(is.na(Split)) .SD else .SD[c(1L,1L)], by = .(ID, Chrom)]
Now, the last two steps will update the End
and Start
columns and then the Probes
column according to the new column values
res2[!is.na(Split), `:=`(i.End = c(Split[1L], i.End[-1L]),
Start = c(Start[-1L], Split[1L] + 1L)),
by = .(ID, Chrom)]
res2[!is.na(Split), Probes := i.End - Start]
res2
# ID Chrom Split End Start i.End Probes
# 1: 1 1 NA NA 1 900 899
# 2: 2 1 1000 1000 500 1000 500
# 3: 2 1 1000 1000 1001 5000 3999
# 4: 3 2 2000 2000 1000 2000 1000
# 5: 3 2 2000 2000 2001 5000 2999
(You can remove unwanted columns if you wish)
First define a splitting function:
splitter<-function(data, reftable){
splitsite <- which(reftable$Chrom == data$Chrom)
if(reftable$Split[splitsite] > data$Start && reftable$Split[splitsite] <= data$End){
return(data.frame(ID = data$ID,
Chrom = data$Chrom,
Start = c(data$Start, reftable$Split[splitsite] + 1),
End = c(reftable$Split[splitsite],data$End),
Probes = c((reftable$Split[splitsite]- data$Start)*data$Probes/(data$End-data$Start),
((data$End - (reftable$Split[splitsite] + 1))*data$Probes/(data$End-data$Start)))))
} else {
return(data)
}
}
then we can run it on each line using dplyr
:
library(dplyr)
data %>% group_by(ID) %>%
do(splitter(., ref.table))
giving the below. You can see it has 3999 and 2999 rather than your 4000 and 3000, I'm not sure which you want based on your row 1. You can fix it by takign out the +1 in ((data$End - (reftable$Split[splitsite] + 1))
ID Chrom Start End Probes
1 1 1 1 900 899
2 2 1 500 1000 500
3 2 1 1001 5000 3999
4 3 2 1000 2000 1000
5 3 2 2001 5000 2999
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With