While removing rows that are duplicates in two particular columns, is it possible to preferentially retain one of the duplicate rows based upon a third column?
Consider the following example:
# Example dataframe.
df <- data.frame(col.1 = c(1, 1, 1, 2, 2, 2, 3),
col.2 = c(1, 1, 1, 2, 2, 2, 2),
col.3 = c('b', 'c', 'a', 'b', 'a', 'b', 'c'))
# Output
col.1 col.2 col.3
1 1 b
1 1 c
1 1 a
2 2 b
2 2 a
2 2 b
3 2 c
I would like to remove rows that are duplicates in both col.1
and col.2
, while always keeping the duplicate row that has col.3 == 'a'
, otherwise having no preference for the duplicate row that is retained. In the case of this example, the resultant data frame would look like this:
# Output.
col.1 col.2 col.3
1 1 a
2 2 a
3 2 c
All help is appreciated!
Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe In Python.
Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .
We can order first on col.3
and remove duplicates, i.e.
d1 <- df[with(df, order(col.3)),]
d1[!duplicated(d1[c(1, 2)]),]
# col.1 col.2 col.3
#3 1 1 a
#5 2 2 a
#7 3 2 c
Since you want to retain a
one option is to arrange
them and get the 1st row in each group.
library(dplyr)
df %>%
arrange_all() %>%
group_by(col.1, col.2) %>%
slice(1)
# col.1 col.2 col.3
# <dbl> <dbl> <fct>
#1 1 1 a
#2 2 2 a
#3 3 2 c
If the col.3
values are not sequential, you can manually arrange
them by doing
df %>%
arrange(col.1, col.2, match(col.3, c("a", "b", "c"))) %>%
group_by(col.1, col.2) %>%
slice(1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With