I have 2 dataframes:
> access
V1 V2 V3
1 chr10 136122 136533
2 chr10 179432 179769
3 chr10 182988 183371
4 chr10 224234 224489
5 chr10 237693 237958
and
> peaks
V1 V2 V3
1 chr10 126122 126533
2 chr10 179450 179730
3 chr10 182788 183350
4 chr10 224244 224500
5 chr10 237695 237950
The coloumn V2 and V3 are start and end of regions (range) in both dataframes. I want to keep those rows in peaks dataframe for which access$V1 == peaks$V1 AND which fall in the range (or regions) of access dataframe. For example the new dataframe will be like: peaks dataframe's
1st row region doesn't exist in access dataframe so it will be assigned category U.
2nd row of peaks falls in the given range in access dataframe (2nd row) and will be assigned category B.
3rd row of peaks doesn't completely fall in that region but it somehow overlaps with region in 3rd row of access, for this I will assign category A.
4th row of peaks also doesn't overlap completely at it ends 11 number after the end of region in row 4 of access, this will also be in category A.
5th row falls in the region hence will be in category B.
Expected output:
> newdf
V1 V2 V3 V4
1 chr10 126122 126533 U
2 chr10 179450 179730 B
3 chr10 182788 183350 A
4 chr10 224244 224500 A
5 chr10 237695 237950 B
Here are the dput of input dataframes:
> dput(peaks)
structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "chr10", class = "factor"),
V2 = c(126122L, 179450L, 182788L, 224244L, 237695L), V3 = c(126533L,
179730L, 183350L, 224500L, 237950L)), .Names = c("V1", "V2",
"V3"), class = "data.frame", row.names = c(NA, -5L))
> dput(access)
structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "chr10", class = "factor"),
V2 = c(136122L, 179432L, 182988L, 224234L, 237693L), V3 = c(136533L,
179769L, 183371L, 224489L, 237958L)), .Names = c("V1", "V2",
"V3"), class = "data.frame", row.names = c(NA, -5L))
Edit:
My new access df looks like this and now I also want to append the last column in my final output df:
> access
V1 V2 V3 V4
1 chr10 136122 136533 found
2 chr10 179432 179769 notFound
3 chr10 182988 183371 found
4 chr10 224234 224489 found
5 chr10 237693 237958 notFound
So now there is one extra condition which is if row in access falls in peaks range then also append the value in V4 in a new column in final df, if some region is not found then by default will be notFound. Therefore, final output will be:
> newdf
V1 V2 V3 V4 V5
1 chr10 126122 126533 U notFound
2 chr10 179450 179730 B notFound
3 chr10 182788 183350 A found
4 chr10 224244 224500 A found
5 chr10 237695 237950 B notFound
Here in row1$V5 the value is notFound because this region was not found and in remaining cases we got the values in V5 from modified access df.
If speed is a concern, the linked data.table solution will probably be superior, but it can also be achieved in dplyr, but likely much slower:
library(dplyr)
names(access)[2:3] <- c('start', 'end')
bind_cols(peaks[-1], access) %>%
rowwise() %>%
mutate(V4 = if_else(all(V2:V3 %in% start:end), 'B',
if_else(any(V2:V3 %in% start:end), 'A',
'U')))
Result:
Source: local data frame [5 x 6]
Groups: <by row>
# A tibble: 5 x 6
V2 V3 V1 start end V4
<int> <int> <fctr> <int> <int> <chr>
1 126122 126533 chr10 136122 136533 U
2 179450 179730 chr10 179432 179769 B
3 182788 183350 chr10 182988 183371 A
4 224244 224500 chr10 224234 224489 A
5 237695 237950 chr10 237693 237958 B
Though its a long way of doing it. But it gives the desired results.
library(dplyr)
df<-cbind(peaks,access) #merging both df
colnames(df)<-c("pV1","pV2","pV3","aV1","aV2","aV3")
df<-df[c(which(df$pV1==df$aV1)),] # selecting rows with pV1=aV1
# creating U, A, B
U1<-df%>%
filter(pV2<aV2 & pV3<aV2)%>%
mutate(V4="U")
U2<-df%>%
filter(pV2>aV3 & pV3>aV3)%>%
mutate(V4="U")
B<-df%>%
filter(pV2>aV2 & pV3<aV3)%>%
mutate(V4="B")
A1<-df%>%
filter(pV2>aV2 & pV3>aV3)%>%
mutate(V4="A")
A2<-df%>%
filter(pV2<aV2 & pV3<aV3 & pV3>aV2)%>%
mutate(V4="A")
#merging U, A and B into newdf
newdf<-arrange(rbind(U1,U2,B,A1,A2),pV2)
newdf<-newdf[,-c(4:6)]
newdf
pV1 pV2 pV3 V4
1 chr10 126122 126533 U
2 chr10 179450 179730 B
3 chr10 182788 183350 A
4 chr10 224244 224500 A
5 chr10 237695 237950 B
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With