Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preferential removal of partial duplicates in a dataframe

Tags:

dataframe

r

While removing rows that are duplicates in two particular columns, is it possible to preferentially retain one of the duplicate rows based upon a third column?

Consider the following example:

# Example dataframe.
df <- data.frame(col.1 = c(1, 1, 1, 2, 2, 2, 3),
                 col.2 = c(1, 1, 1, 2, 2, 2, 2),
                 col.3 = c('b', 'c', 'a', 'b', 'a', 'b', 'c'))
# Output
col.1 col.2 col.3
    1     1     b
    1     1     c
    1     1     a
    2     2     b
    2     2     a
    2     2     b
    3     2     c

I would like to remove rows that are duplicates in both col.1 and col.2, while always keeping the duplicate row that has col.3 == 'a', otherwise having no preference for the duplicate row that is retained. In the case of this example, the resultant data frame would look like this:

# Output.
col.1 col.2 col.3
    1     1     a
    2     2     a
    3     2     c

All help is appreciated!

like image 382
Lorcán Avatar asked May 20 '19 13:05

Lorcán


People also ask

What is the correct method to remove duplicates from a Pandas data frame?

Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe In Python.

How do you remove duplicate observations from a data frame?

Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .


2 Answers

We can order first on col.3 and remove duplicates, i.e.

d1 <- df[with(df, order(col.3)),]
d1[!duplicated(d1[c(1, 2)]),]
#  col.1 col.2 col.3
#3     1     1     a
#5     2     2     a
#7     3     2     c
like image 135
Sotos Avatar answered Oct 04 '22 18:10

Sotos


Since you want to retain a one option is to arrange them and get the 1st row in each group.

library(dplyr)

df %>%
  arrange_all() %>%
  group_by(col.1, col.2) %>%
  slice(1)

#  col.1 col.2 col.3
#  <dbl> <dbl> <fct>
#1     1     1 a    
#2     2     2 a    
#3     3     2 c    

If the col.3 values are not sequential, you can manually arrange them by doing

df %>%
  arrange(col.1, col.2, match(col.3, c("a", "b", "c"))) %>%
  group_by(col.1, col.2) %>%
  slice(1)
like image 26
Ronak Shah Avatar answered Oct 04 '22 17:10

Ronak Shah