I wrote the following function, it works. However it is very slow when df1 has 1700 rows, and df2 has 70000 rows. Is there anyway to improve the efficiency?
rowcheck <- function(df1, df2){
         apply(df1, 1, function(x) any(apply(df2, 1, function(y) all(y==x))))
}
An example I wrote this function to apply to is: I want to check whether each row in df1 is contained as a row in df2:
df1=data.frame(a=c(1:3),b=c("a","b","c"))
df2=data.frame(a=c(1:6),b=rep(c("a","b","c"),2))
For each row of df1, I want to check if it is contained as a row in df2. I want to return of the function to be a logical vector of length nrow(df1).
Thank you for your help.
To find rows present in one dataframe that are not present in the other is known as set-difference. In this article, we will see different ways to do the same.
Pandas Compare two data frames and look for duplicate elements 1 Check if a row in a pandas dataframe exists in other dataframes and assign points depending on which dataframes it also belongs to Related
The following syntax explains how to find duplicate rows in two data frames using the inner_join function of the dplyr add-on package. In order to apply the functions of the dplyr package, we first need to install and load dplyr: Next, we can apply the inner_join function like this:
Identifying duplicate records on Python in Dataframes 0 Pandas Compare two data frames and look for duplicate elements 1 Check if a row in a pandas dataframe exists in other dataframes and assign points depending on which dataframes it also belongs to
One way is to paste the rows together, and compare them with %in%.  The result is a logical vector the length of nrow(df1), as requested.
do.call(paste0, df1) %in% do.call(paste0, df2)
# [1] TRUE TRUE TRUE
Try:
Filter(function(x) x > 0, which(duplicated(rbind(df2, df1))) - nrow(df2))
It will tell you which row numbers in df1 occur in df2. If you want an atomic vector of logicals like in Richard Scriven's answer, try
duplicated(rbind(df2, df1))[-seq_len(nrow(df2))]
It is also faster since it uses an internal C function duplicated (mine is rowcheck2)
> microbenchmark(rowcheck(df1, df2), rowcheck2(df1, df2))
 Unit: milliseconds
                expr      min       lq   median       uq       max neval
  rowcheck(df1, df2) 2.045210 2.169182 2.328296 3.539328 13.971517   100
  rowcheck2(df1, df2) 1.046207 1.112395 1.243390 1.727921  7.442499   100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With