Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a %in% operator across multiple columns

Tags:

r

unique

paste

Imagine you have two data frames

df1 <- data.frame(V1 = c(1, 2, 3), v2 = c("a", "b", "c"))
df2 <- data.frame(V1 = c(1, 2, 2), v2 = c("b", "b", "c"))

Here's what they look like, side by side:

> cbind(df1, df2)
  V1 v2 V1 v2
1  1  a  1  b
2  2  b  2  b
3  3  c  2  c

You want to know which observations are duplicates, across all variables.

This can be done by pasting the cols together and then using %in%:

df1Vec <- apply(df1, 1, paste, collapse= "")
df2Vec <- apply(df2, 1, paste, collapse= "")
df2Vec %in% df1Vec
[1] FALSE  TRUE FALSE

The second observation is thus the only one in df2 and also in df1.

Is there no faster way of generating this output - something like %IN%, which is %in% across multiple variables, or should we just be content with the apply(paste) solution?

like image 287
RobinLovelace Avatar asked May 31 '14 14:05

RobinLovelace


1 Answers

I would go with

interaction(df2) %in% interaction(df1)
# [1] FALSE  TRUE FALSE

You can wrap it in a binary operator:

"%IN%" <- function(x, y) interaction(x) %in% interaction(y)

Then

df2 %IN% df1
# [1] FALSE  TRUE FALSE

rbind(df2, df2) %IN% df1
# [1] FALSE  TRUE FALSE FALSE  TRUE FALSE

Disclaimer: I have somewhat modified my answer from a previous one that was using do.call(paste, ...) instead of interaction(...). Consult the history if you like. I think that Arun's claims about "terrible inefficiency" (a bit extreme IMHO) still hold but if you like a concise solution that uses base R only and is fast-ish with small-ish data that's probably it.

like image 61
flodel Avatar answered Oct 02 '22 01:10

flodel