Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to bypass a nested for loop?

Tags:

loops

r

So the situation is this: I basically have one data frame where it contains about 100,000 rows of data. I am interested in a particular column of data, POS, and I wanted to check if the value of POS is between two values of another data frame, Start and End, and keep track of how many instances of those are there.

E.g., in my first data frame, I have something like

ID POS  
A   20  
B   533  
C   600 

And in my other data frame, I have stuff like

START      END  
123        150  
489        552  
590        600  

I want to know how many items in POS are in any of the START-END ranges. So in this case, there's be 2 items. Also, if possible, can I get the IDs of the ones with POS between Start and End, too?

How can I go about doing that without having to use a nested for loop?

like image 981
Alex Johanssen Avatar asked Dec 08 '22 15:12

Alex Johanssen


2 Answers

This is a fairly common problem which might happen in the context of a database. Here is a solution using sqldf:

library(sqldf)

query <- "SELECT POS, ID FROM df1 INNER JOIN df2 "
query <- paste0(query, "ON df1.POS BETWEEN df2.START AND df2.END")
sqldf(query)

If the ranges in your second data frame might overlap, then the above query could return more than one result for a given POS value. In this case, replace SELECT POS with SELECT DISTINCT POS.

like image 116
Tim Biegeleisen Avatar answered Dec 29 '22 16:12

Tim Biegeleisen


We can use a non-equi join with data.table

library(data.table)
setDT(df1)[df2, on = .(POS > START, POS <= END)][, sum(!is.na(ID))]
#[1] 2
like image 41
akrun Avatar answered Dec 29 '22 16:12

akrun