I have two files
encode
X.pattern.name chr start stop strand score p.value q.value matched.sequence
1 V_CETS1P54_01 chr1 98769545 98769554 + 11.42280 8.89e-05 NA TCAGGATGTA
2 V_CETS1P54_01 chr1 152013037 152013046 + 11.98020 4.74e-05 NA ACAGGAAGTT
3 V_CETS1P54_01 chr1 168932563 168932572 + 11.60860 7.59e-05 NA ACCGGATGCT
encode.total
chr start stop
1 chr1 58708485 58708713
2 chr1 58709084 58710538
3 chr1 98766295 98766639
4 chr1 98766902 98770338
5 chr1 107885506 107889414
6 chr1 138589531 138590856
7 chr1 138591180 138593378
8 chr1 152011746 152013185
9 chr1 152014263 152014695
10 chr1 168930561 168933076
11 chr1 181808064 181808906
12 chr1 184609002 184611519
13 chr1 193288453 193289567
14 chr1 193290105 193290490
15 chr1 193290744 193291092
16 chr1 196801920 196804954
I want to compare the two files, each entry by chr, start and stop. If the start and stop values in first file falls between the start and stop of the second file for the same chromosome, then that start & stop values in the first file should be replaced by the second file's start & stop values. I have written a for loop for that purpose but it is taking too long. What are the alternatives?
Code:
for(i in 1:nrow(encode))
{
for(j in 1:nrow(encode.total))
{
if(encode[i,2]==encode.total[j,1])
{
if((encode[i,3]>=encode.total[j,2]) & (encode[i,4]<=encode.total[j,3]))
{
encode[i,3]=encode.total[j,2]
encode[i,4]=encode.total[j,3]
}
}
}
}
For the same purpose I also tried the GenomicRanges package like below. The size of my dataframes is huge and the merge function on them creates a VERY LARGE dataframe (>2 billion rows which is not allowed), although I eventually subset the dataframe in a smaller one. But merge() is taking a LOT of memory and terminating R.
gr1<-GRanges(seqnames=encode$chr,IRanges(start=encode$start,end=encode$end))
gr2<-GRanges(seqnames=encode.total$chr, IRanges(start=encode.total$start,end=encode.total$end))
ranges <- merge(as.data.frame(gr1),as.data.frame(gr2),by="seqnames",suffixes=c("A","B"))
ranges <- ranges[with(ranges, startB <= startA & endB >= endA),]
Use the Bioconductor GenomicRanges package.
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
Suppose there are two data frames x0
and x1
, like encode
and encode.total
in the original example. We'd like to make these into a GRanges instance. I did this with
library(GenomicRanges)
gr0 = with(x0, GRanges(chr, IRanges(start=start, end=stop)))
gr1 = with(x1, GRanges(chr, IRanges(start=start, end=stop)))
It will often be possible to simply say makeGRangesFromDataFrame(x0)
, or use standard R commands to create a GRanges instance 'by hand'. Here we use with()
, so that we can write GRanges(chr, IRanges(start=start, end=stop))
instead of GRanges(x0$chr, IRanges(start=x0$start, end=x0$stop))
.
The next step is to find overlaps between query (gr0) and subject (gr1)
hits = findOverlaps(gr0, gr1)
which leads to
> hits
Hits of length 3
queryLength: 3
subjectLength: 16
queryHits subjectHits
<integer> <integer>
1 1 4
2 2 8
3 3 10
Then updated the relevant start / end coordinates
ranges(gr0)[queryHits(hits)] = ranges(gr1)[subjectHits(hits)]
giving
> gr0
GRanges with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [ 98766902, 98770338] *
[2] chr1 [152011746, 152013185] *
[3] chr1 [168930561, 168933076] *
---
seqlengths:
chr1
NA
This will be fast up into the millions of ranges.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With