Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pair-wise duplicate removal from dataframe [duplicate]

This seems like a simple problem but I can't seem to figure it out. I'd like to remove duplicates from a dataframe (df) if two columns have the same values, even if those values are in the reverse order. What I mean is, say you have the following data frame:

a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c('A','B','B','C','A','A','B','B')
df <-data.frame(a,b)

  a b
1 A A
2 A B
3 A B
4 B C
5 B A
6 B A
7 C B
8 C B

If I now remove duplicates, I get the following data frame:

df[duplicated(df),]

  a b
3 A B
6 B A
8 C B

However, I would also like to remove the row 6 in this data frame, since "A", "B" is the same as "B", "A". How can I do this automatically?

Ideally I could specify which two columns to compare since the data frames could have varying columns and can be quite large.

Thanks!

like image 780
user3141121 Avatar asked Aug 13 '14 23:08

user3141121


2 Answers

The other answers use a for loop to assign a value for each and every row. While this is not an issue if you have 100 rows, or even a thousand, you're going to be waiting a while if you have large data of the order of 1M rows.

Stealing from the other linked answer using data.table, you could try something like:

df[!duplicated(data.frame(list(do.call(pmin,df),do.call(pmax,df)))),]

A comparison benchmark with a larger dataset (df2):

df2 <- df[sample(1:nrow(df),50000,replace=TRUE),]

system.time(
  df2[!duplicated(data.frame(list(do.call(pmin,df2),do.call(pmax,df2)))),]
)
# user  system elapsed 
# 0.07    0.00    0.06 

system.time({
  for (i in 1:nrow(df2))
  {
      df2[i, ] = sort(df2[i, ])
  }
  df2[!duplicated(df2),]
}
)
#   user  system elapsed 
#  42.07    0.02   42.09 
like image 158
thelatemail Avatar answered Sep 29 '22 17:09

thelatemail


Extending Ari's answer, to specify columns to check if other columns are also there:

a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c('A','B','B','C','A','A','B','B')
df <-data.frame(a,b)

df$c = sample(1:10,8)
df$d = sample(LETTERS,8)
df
  a b  c d
1 A A 10 B
2 A B  8 S
3 A B  7 J
4 B C  3 Q
5 B A  2 I
6 B A  6 U
7 C B  4 L
8 C B  5 V

cols = c(1,2)
newdf = df[,cols]
for (i in 1:nrow(df)){
    newdf[i, ] = sort(df[i,cols])
}

df[!duplicated(newdf),]
  a b c d
1 A A 8 X
2 A B 7 L
4 B C 2 P
like image 28
rnso Avatar answered Sep 29 '22 17:09

rnso