I got 2 files which I'd like to combine using R.
head(bed)
chr8 41513235 41513282 ANK1.Exon1
chr8 41518973 41519092 ANK1.Exon2
The first one is giving intervals and their names. (Chromosome, from, to, name)
head(coverage)
chr1 41513235 20
chr1 41513236 19
chr1 41513237 19
The second one is giving coverages for single Bases. (Chromosome, position, coverage)
I now want to get the name of each Exon written next to each Position. This will result in some positions with no "Exon" which I want to delete afterwards.
I figured out a ways how to do what I want. However it needs 3 for loops and about 15 hours computing time. Since for loops are not best practice in R I'd like to know if anyone knows a better way than:
coverage <- cbind(coverage, "Exon")
coverage[,4] <- NA
for(i in 1:nrow(bed)){
for(n in bed[i,2]:bed[i,3]{
for(m in 1:nrow(coverage)){
if(coverage[m,2]==n){
file[m,4] <- bed[i,4]
}
}
}
}
na.omit(coverage)
Since all of the three positions lie in the intervall "ANK1.Exon1", the output should look like this:
head(coverage)
chr1 41513235 20 ANK1.Exon1
chr1 41513236 19 ANK1.Exon1
chr1 41513237 19 ANK1.Exon1
Array. filter, map, some have the same performance as forEach. These are all marginally slower than for/while loop. Unless you are working on performance-critical functionalities, it should be fine using the above methods.
A FOR loop is the most intuitive way to apply an operation to a series by looping through each item one by one, which makes perfect sense logically but should be avoided by useRs given the low efficiency.
The apply functions do run a for loop in the background. However they often do it in the C programming language (which is used to build R). This does make the apply functions a few milliseconds faster than regular for loops.
For loop in R Programming Language is useful to iterate over the elements of a list, dataframe, vector, matrix, or any other object. It means, the for loop can be used to execute a group of statements repeatedly depending upon the number of elements in the object.
The fastest way to perform what I was looking for was:
library("sqldf")
res <- sqldf("select * from coverage f1 inner join bed f2
on(f1.position >=f2.'from' and f1.position <=f2.'to')")
The calculation time went down to seconds. To get the exact result as indicated above the dataframe was further reduced.
res <- cbind(res[1:4],res[8])
Thank you all for your help.
Edit: For large datasets were the same positions may appear in more than one Chromosome it is helpfull to further add:
res <- sqldf("select * from coverage f1 inner join bed f2
on(f1.position >=f2.'from' and f1.position <=f2.'to' and f1.Chromosome = f2.Chromosome)")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With