Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find point-to-range overlaps

I have a dataframe df1:

df1 <- read.table(text=" Chr06  79641   
Chr06   82862   
Chr06   387314  
Chr06   656098  
Chr06   678491  
Chr06   1018696", header=FALSE, stringsAsFactors=FALSE)

I would like to check if each row in df1 is contained in a range in df2. the column2 in df2 is the start of a range, and column3 is the end of a range. no overlapping between ranges (between rows). The data in df2 are sorted by Column1 and column2. I wrote a loop for this but I am not happy to it because It runs so long time if I have a few thousands rows in df1. I would like to find a more efficient way to do this job (better no looping). Thanks. The df2 data frame:

df2 <- read.table(text=" Chr05  799 870
Chr06   77914   77942
Chr06   78233   78269
Chr06   78719   78836
Chr06   79720   87043
Chr06   87223   87305
Chr06   380020  380060
Chr06   387314  387371
Chr06   654907  654988
Chr06   657929  658057
Chr06   677198  677229
Chr06   679555  680170
Chr06   1015425 1015475
Chr06   1018676 1018736
Chr06   1020564 1020592", header=FALSE, stringsAsFactors=FALSE)

My script:

df1$V3 <- FALSE
for (i in 1:dim(df1)[1]) {
  for (j in 1:dim(df2)[1]) {
    if ((df1[i,1] == df2[j,1]) && (df1[i,2]>=df2[j,2]) 
        && (df1[i,2]<=df2[j,3])) {
      df1[i,3]<-TRUE
      break;
    }
  }
}
df1

The expected result is shown as df1.

like image 349
user3354212 Avatar asked Jan 17 '26 22:01

user3354212


1 Answers

Using GenomicRanges:

#Convert to Granges objects
gr1 <- GRanges(seqnames = df1$V1,
               ranges = IRanges(df1$V2, df1$V2))

gr2 <- GRanges(seqnames = df2$V1,
               ranges = IRanges(df2$V2, df2$V3))
#Subset gr1
subsetByOverlaps(gr1, gr2)

# GRanges object with 3 ranges and 0 metadata columns:
#       seqnames             ranges strand
#          <Rle>          <IRanges>  <Rle>
#  [1]    Chr06 [  82862,   82862]      *
#  [2]    Chr06 [ 387314,  387314]      *
#  [3]    Chr06 [1018696, 1018696]      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

#Or we can use merge
mergeByOverlaps(gr1, gr2)

# DataFrame with 3 rows and 2 columns
#                          gr1                        gr2
#                    <GRanges>                  <GRanges>
# 1 Chr06:*:[  82862,   82862] Chr06:*:[  79720,   87043]
# 2 Chr06:*:[ 387314,  387314] Chr06:*:[ 387314,  387371]
# 3 Chr06:*:[1018696, 1018696] Chr06:*:[1018676, 1018736]

Also, look into bedtools:

Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.

like image 146
zx8754 Avatar answered Jan 20 '26 15:01

zx8754



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!