Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if each row of a data frame is contained in another data frame

Tags:

dataframe

r

I wrote the following function, it works. However it is very slow when df1 has 1700 rows, and df2 has 70000 rows. Is there anyway to improve the efficiency?

rowcheck <- function(df1, df2){
         apply(df1, 1, function(x) any(apply(df2, 1, function(y) all(y==x))))
}

An example I wrote this function to apply to is: I want to check whether each row in df1 is contained as a row in df2:

df1=data.frame(a=c(1:3),b=c("a","b","c"))
df2=data.frame(a=c(1:6),b=rep(c("a","b","c"),2))

For each row of df1, I want to check if it is contained as a row in df2. I want to return of the function to be a logical vector of length nrow(df1).

Thank you for your help.

like image 581
Bruce Chen Avatar asked Mar 26 '14 21:03

Bruce Chen


People also ask

How to find rows present in one Dataframe but not the other?

To find rows present in one dataframe that are not present in the other is known as set-difference. In this article, we will see different ways to do the same.

How to compare two data frames in pandas?

Pandas Compare two data frames and look for duplicate elements 1 Check if a row in a pandas dataframe exists in other dataframes and assign points depending on which dataframes it also belongs to Related

How do I find duplicate rows in two data frames?

The following syntax explains how to find duplicate rows in two data frames using the inner_join function of the dplyr add-on package. In order to apply the functions of the dplyr package, we first need to install and load dplyr: Next, we can apply the inner_join function like this:

How to identify duplicate records on Python in DataFrames?

Identifying duplicate records on Python in Dataframes 0 Pandas Compare two data frames and look for duplicate elements 1 Check if a row in a pandas dataframe exists in other dataframes and assign points depending on which dataframes it also belongs to


2 Answers

One way is to paste the rows together, and compare them with %in%. The result is a logical vector the length of nrow(df1), as requested.

do.call(paste0, df1) %in% do.call(paste0, df2)
# [1] TRUE TRUE TRUE
like image 143
Rich Scriven Avatar answered Oct 13 '22 04:10

Rich Scriven


Try:

Filter(function(x) x > 0, which(duplicated(rbind(df2, df1))) - nrow(df2))

It will tell you which row numbers in df1 occur in df2. If you want an atomic vector of logicals like in Richard Scriven's answer, try

duplicated(rbind(df2, df1))[-seq_len(nrow(df2))]

It is also faster since it uses an internal C function duplicated (mine is rowcheck2)

> microbenchmark(rowcheck(df1, df2), rowcheck2(df1, df2))
 Unit: milliseconds
                expr      min       lq   median       uq       max neval
  rowcheck(df1, df2) 2.045210 2.169182 2.328296 3.539328 13.971517   100
  rowcheck2(df1, df2) 1.046207 1.112395 1.243390 1.727921  7.442499   100
like image 37
Robert Krzyzanowski Avatar answered Oct 13 '22 04:10

Robert Krzyzanowski