Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to order rows by conditions in other columns in r?

I have a genetic dataset where I am looking to order samples/genes, grouping by those which are in a certain distance to each other in the genome.

So for example my dataset looks like:

#dt1
Gene    chromosome position      CP 
Gene1      1       70000200   1:70000200
Gene2      5       10000476   5:10000476
Gene3      1       70000201   1:70000201
Gene4      5       10000475   5:10000475

I also have an origin position dataset:

#dt2
chromosome   position  CP
    1        70005000  1:70005000
    5        10005000  5:10005000

I am trying to group genes in my 1st dataset if they are within +/- 500000 distance of any position in my second dt2 dataset and are on the same chromosome. I have an issue in my actual data where this might be true for a gene against multiple origin dt2 positions, so I'm also trying to sort to the one it is closest to.

Output aims to give ordered groups:

Gene   chromosome  position   Group 
Gene1      1       70000200    1
Gene3      1       70000201    1
Gene4      5       10000475    2
Gene2      5       10000476    2

Gene1 and Gene3 are within the 500000 of an origin dt2 position and all are on the same chromosome so grouped together and the same for Genes 4 and 2

Currently I am trying to do this with:

dt2[, c("low", "high") := .(position - 500000, position  + 500000)]

#find matches on chromosome, with position between low&high
dt1[ dt2, match := i.CP,
     on = .(chromosome, position > low, position < high ) ]

#outputs:
    Gene chromosome position    CP        match
1   Gene1   1   70000200    1:70000200  1:70005000
2   Gene2   5   10000476    5:10000476  5:10005000
3   Gene3   1   70000201    1:70000201  1:70005000
4   Gene4   5   10000475    5:10000475  5:10005000

I am having problems with this on 2 levels with seemingly not getting this output for the match column on my actual data, so I am wondering if there are other ways to code this that I can try. I am also struggling to convert the match column into grouping matches and identifying the groups as I want in my expected output, I have a biology background so I'm unsure how to change this - any help would be appreciated.

Input data:

#dt1:
structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4"), 
    chromosome = c(1L, 5L, 1L, 5L), position = c(70000200L, 10000476L, 
    70000201L, 10000475L), CP = c("1:70000200", "5:10000476", 
    "1:70000201", "5:10000475"), match = c("1:70005000", "5:10005000", 
    "1:70005000", "5:10005000")), row.names = c(NA, -4L), class = c("data.table", 
"data.frame"))

#dt2:
structure(list(chromosome = c(1L, 5L), position = c(70005000L, 
10005000L), CP = c("1:70005000", "5:10005000"), low = c(69505000, 
9505000), high = c(70505000, 10505000)), row.names = c(NA, -2L
), class = c("data.table", "data.frame"))
like image 970
DN1 Avatar asked Jul 01 '20 18:07

DN1


3 Answers

There are packages geared for doing this in Bioconductor. So one thing you can use is distanceToNearest() from GenomicRanges . First we convert them to GRanges objects:

library(GenomicRanges)
gr1=makeGRangesFromDataFrame(dt1,seqnames.field="chromosome",start.field="position",end.field="position")
values(gr1) = dt1[,c("Gene","CP")]

Give a group for dt2:

dt2$Group = 1:nrow(dt2)
gr2=makeGRangesFromDataFrame(dt2,seqnames.field="chromosome",start.field="position",end.field="position")

This step will match every row in gr1 (dt1 GRanges) to its nearest Range in gr2 (dt2):

matches = distanceToNearest(gr1,gr2)
Hits object with 4 hits and 1 metadata column:
      queryHits subjectHits |  distance
      <integer>   <integer> | <integer>
  [1]         1           1 |      4799
  [2]         2           2 |      4523
  [3]         3           1 |      4798
  [4]         4           2 |      4524
  -------
  queryLength: 4 / subjectLength: 2

We assign this result back:

dt1$group = NA
dt1$group[queryHits(matches)] = dt2$Group[subjectHits(matches)]
dt1$distance = NA
dt1$distance[queryHits(matches)] = values(matches)$distance[subjectHits(matches)]
dt1

    Gene chromosome position         CP      match group distance
1: Gene1          1 70000200 1:70000200 1:70005000     1     4799
2: Gene2          5 10000476 5:10000476 5:10005000     2     4523
3: Gene3          1 70000201 1:70000201 1:70005000     1     4799
4: Gene4          5 10000475 5:10000475 5:10005000     2     4523

Now you can filter away those that are > 500000

like image 183
StupidWolf Avatar answered Oct 25 '22 22:10

StupidWolf


If I understand correctly, the OP wants

  • to group genes in dt1 if they are within +/- 500000 distance of any position in dt2 and are on the same chromosome.
  • In case, a gene matches multiple origin dt2 positions, the closest one is to be selected.

So, a rolling join to nearest with subsequent filtering might be the answer.

library(data.table)
dt2[, Group := .I][
  dt1, on = .(chromosome, position), roll = "nearest",
  .(Gene, chromosome, position, CP = i.CP, 
    Group = fifelse(abs(x.position - i.position) <= 500000L, Group, NA_integer_))][
      order(Group, CP)]
     Gene chromosome position         CP Group
1:  Gene1          1 70000200 1:70000200     1
2:  Gene3          1 70000201 1:70000201     1
3: Gene12          1 70005199 1:70005199     1
4: Gene13          1 70005900 1:70005900     2
5:  Gene4          5 10000475 5:10000475     3
6:  Gene2          5 10000476 5:10000476     3
7: Gene11          1 80000200 1:80000200    NA

Please, note that expanded datasets are used here in order to check if the approach is working for different use cases.

There is one gene, Gene11, in the expanded dataset which has been assigned to no group, i.e., Group == NA, because it has no matching position in dt2 within the distance threshold of +/- 500000.

The approach is similar to StupidWolf's GenomicRanges answer but uses only data.table functionality.

Data

Both datasets have been expanded in order to cover different uses cases:

library(data.table)
dt1 <- fread("Gene    chromosome position      CP 
Gene1      1       70000200   1:70000200
Gene2      5       10000476   5:10000476
Gene3      1       70000201   1:70000201
Gene4      5       10000475   5:10000475
Gene11     1       80000200   1:80000200
Gene12     1       70005199   1:70005199
Gene13     1       70005900   1:70005900")
dt2 <- fread("chromosome   position  CP
    1        70005000  1:70005000
    1        70006000  1:70006000
    5        10005000  5:10005000")
like image 26
Uwe Avatar answered Oct 25 '22 23:10

Uwe


I think you are looking for the idiomatic way of performing a non-equi join and then update by reference, i.e.

dt2[, c("rn", "low", "high") := .(.I, position - 500000L, position  + 500000L)]
#note that you perform the non-equi join first, and then 
#extract the result column before `:=`, which updates by reference
dt1[, Group := dt2[.SD, on=.(chromosome, low<position, high>position), rn]]
dt1

edit regarding multiple matches. In this case you will require a left join:

dt2[, group := .I][dt1, on=.(chromosome, low<position, high>position)]
like image 28
chinsoon12 Avatar answered Oct 26 '22 00:10

chinsoon12