I have a genetic dataset where I am looking to order samples/genes, grouping by those which are in a certain distance to each other in the genome.
So for example my dataset looks like:
#dt1
Gene chromosome position CP
Gene1 1 70000200 1:70000200
Gene2 5 10000476 5:10000476
Gene3 1 70000201 1:70000201
Gene4 5 10000475 5:10000475
I also have an origin position dataset:
#dt2
chromosome position CP
1 70005000 1:70005000
5 10005000 5:10005000
I am trying to group genes in my 1st dataset if they are within +/- 500000 distance of any position in my second dt2 dataset and are on the same chromosome. I have an issue in my actual data where this might be true for a gene against multiple origin dt2 positions, so I'm also trying to sort to the one it is closest to.
Output aims to give ordered groups:
Gene chromosome position Group
Gene1 1 70000200 1
Gene3 1 70000201 1
Gene4 5 10000475 2
Gene2 5 10000476 2
Gene1 and Gene3 are within the 500000 of an origin dt2 position and all are on the same chromosome so grouped together and the same for Genes 4 and 2
Currently I am trying to do this with:
dt2[, c("low", "high") := .(position - 500000, position + 500000)]
#find matches on chromosome, with position between low&high
dt1[ dt2, match := i.CP,
on = .(chromosome, position > low, position < high ) ]
#outputs:
Gene chromosome position CP match
1 Gene1 1 70000200 1:70000200 1:70005000
2 Gene2 5 10000476 5:10000476 5:10005000
3 Gene3 1 70000201 1:70000201 1:70005000
4 Gene4 5 10000475 5:10000475 5:10005000
I am having problems with this on 2 levels with seemingly not getting this output for the match column on my actual data, so I am wondering if there are other ways to code this that I can try. I am also struggling to convert the match column into grouping matches and identifying the groups as I want in my expected output, I have a biology background so I'm unsure how to change this - any help would be appreciated.
Input data:
#dt1:
structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4"),
chromosome = c(1L, 5L, 1L, 5L), position = c(70000200L, 10000476L,
70000201L, 10000475L), CP = c("1:70000200", "5:10000476",
"1:70000201", "5:10000475"), match = c("1:70005000", "5:10005000",
"1:70005000", "5:10005000")), row.names = c(NA, -4L), class = c("data.table",
"data.frame"))
#dt2:
structure(list(chromosome = c(1L, 5L), position = c(70005000L,
10005000L), CP = c("1:70005000", "5:10005000"), low = c(69505000,
9505000), high = c(70505000, 10505000)), row.names = c(NA, -2L
), class = c("data.table", "data.frame"))
There are packages geared for doing this in Bioconductor. So one thing you can use is distanceToNearest()
from GenomicRanges
. First we convert them to GRanges objects:
library(GenomicRanges)
gr1=makeGRangesFromDataFrame(dt1,seqnames.field="chromosome",start.field="position",end.field="position")
values(gr1) = dt1[,c("Gene","CP")]
Give a group for dt2:
dt2$Group = 1:nrow(dt2)
gr2=makeGRangesFromDataFrame(dt2,seqnames.field="chromosome",start.field="position",end.field="position")
This step will match every row in gr1 (dt1 GRanges) to its nearest Range in gr2 (dt2):
matches = distanceToNearest(gr1,gr2)
Hits object with 4 hits and 1 metadata column:
queryHits subjectHits | distance
<integer> <integer> | <integer>
[1] 1 1 | 4799
[2] 2 2 | 4523
[3] 3 1 | 4798
[4] 4 2 | 4524
-------
queryLength: 4 / subjectLength: 2
We assign this result back:
dt1$group = NA
dt1$group[queryHits(matches)] = dt2$Group[subjectHits(matches)]
dt1$distance = NA
dt1$distance[queryHits(matches)] = values(matches)$distance[subjectHits(matches)]
dt1
Gene chromosome position CP match group distance
1: Gene1 1 70000200 1:70000200 1:70005000 1 4799
2: Gene2 5 10000476 5:10000476 5:10005000 2 4523
3: Gene3 1 70000201 1:70000201 1:70005000 1 4799
4: Gene4 5 10000475 5:10000475 5:10005000 2 4523
Now you can filter away those that are > 500000
If I understand correctly, the OP wants
dt1
if they are within +/- 500000 distance of any position in dt2
and are on the same chromosome.dt2
positions, the closest one is to be selected.So, a rolling join to nearest with subsequent filtering might be the answer.
library(data.table)
dt2[, Group := .I][
dt1, on = .(chromosome, position), roll = "nearest",
.(Gene, chromosome, position, CP = i.CP,
Group = fifelse(abs(x.position - i.position) <= 500000L, Group, NA_integer_))][
order(Group, CP)]
Gene chromosome position CP Group 1: Gene1 1 70000200 1:70000200 1 2: Gene3 1 70000201 1:70000201 1 3: Gene12 1 70005199 1:70005199 1 4: Gene13 1 70005900 1:70005900 2 5: Gene4 5 10000475 5:10000475 3 6: Gene2 5 10000476 5:10000476 3 7: Gene11 1 80000200 1:80000200 NA
Please, note that expanded datasets are used here in order to check if the approach is working for different use cases.
There is one gene, Gene11
, in the expanded dataset which has been assigned to no group, i.e., Group == NA
, because it has no matching position in dt2
within the distance threshold of +/- 500000.
The approach is similar to StupidWolf's GenomicRanges
answer but uses only data.table
functionality.
Both datasets have been expanded in order to cover different uses cases:
library(data.table)
dt1 <- fread("Gene chromosome position CP
Gene1 1 70000200 1:70000200
Gene2 5 10000476 5:10000476
Gene3 1 70000201 1:70000201
Gene4 5 10000475 5:10000475
Gene11 1 80000200 1:80000200
Gene12 1 70005199 1:70005199
Gene13 1 70005900 1:70005900")
dt2 <- fread("chromosome position CP
1 70005000 1:70005000
1 70006000 1:70006000
5 10005000 5:10005000")
I think you are looking for the idiomatic way of performing a non-equi join and then update by reference, i.e.
dt2[, c("rn", "low", "high") := .(.I, position - 500000L, position + 500000L)]
#note that you perform the non-equi join first, and then
#extract the result column before `:=`, which updates by reference
dt1[, Group := dt2[.SD, on=.(chromosome, low<position, high>position), rn]]
dt1
edit regarding multiple matches. In this case you will require a left join:
dt2[, group := .I][dt1, on=.(chromosome, low<position, high>position)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With