So the situation is this: I basically have one data frame where it contains about 100,000 rows of data. I am interested in a particular column of data, POS, and I wanted to check if the value of POS is between two values of another data frame, Start and End, and keep track of how many instances of those are there.
E.g., in my first data frame, I have something like
ID POS
A 20
B 533
C 600
And in my other data frame, I have stuff like
START END
123 150
489 552
590 600
I want to know how many items in POS are in any of the START-END ranges. So in this case, there's be 2 items. Also, if possible, can I get the IDs of the ones with POS between Start and End, too?
How can I go about doing that without having to use a nested for loop?
This is a fairly common problem which might happen in the context of a database. Here is a solution using sqldf
:
library(sqldf)
query <- "SELECT POS, ID FROM df1 INNER JOIN df2 "
query <- paste0(query, "ON df1.POS BETWEEN df2.START AND df2.END")
sqldf(query)
If the ranges in your second data frame might overlap, then the above query could return more than one result for a given POS
value. In this case, replace SELECT POS
with SELECT DISTINCT POS
.
We can use a non-equi join with data.table
library(data.table)
setDT(df1)[df2, on = .(POS > START, POS <= END)][, sum(!is.na(ID))]
#[1] 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With