Imagine you have two data frames
df1 <- data.frame(V1 = c(1, 2, 3), v2 = c("a", "b", "c"))
df2 <- data.frame(V1 = c(1, 2, 2), v2 = c("b", "b", "c"))
Here's what they look like, side by side:
> cbind(df1, df2)
V1 v2 V1 v2
1 1 a 1 b
2 2 b 2 b
3 3 c 2 c
You want to know which observations are duplicates, across all variables.
This can be done by pasting the cols together and then using %in%:
df1Vec <- apply(df1, 1, paste, collapse= "")
df2Vec <- apply(df2, 1, paste, collapse= "")
df2Vec %in% df1Vec
[1] FALSE TRUE FALSE
The second observation is thus the only one in df2 and also in df1.
Is there no faster way of generating this output - something like %IN%, which is %in% across multiple variables, or should we just be content with the apply(paste) solution?
I would go with
interaction(df2) %in% interaction(df1)
# [1] FALSE TRUE FALSE
You can wrap it in a binary operator:
"%IN%" <- function(x, y) interaction(x) %in% interaction(y)
Then
df2 %IN% df1
# [1] FALSE TRUE FALSE
rbind(df2, df2) %IN% df1
# [1] FALSE TRUE FALSE FALSE TRUE FALSE
Disclaimer: I have somewhat modified my answer from a previous one that was using do.call(paste, ...)
instead of interaction(...)
. Consult the history if you like. I think that Arun's claims about "terrible inefficiency" (a bit extreme IMHO) still hold but if you like a concise solution that uses base R only and is fast-ish with small-ish data that's probably it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With