While removing rows that are duplicates in two particular columns, is it possible to preferentially retain one of the duplicate rows based upon a third column? Consider the following example: <pre class="prettyprint"><code># Example dataframe. df <- data.frame(col.1 = c(1, 1, 1, 2, 2, 2, 3), col.2 = c(1, 1, 1, 2, 2, 2, 2), col.3 = c('b', 'c', 'a', 'b', 'a', 'b', 'c')) # Output col.1 col.2 col.3 1 1 b 1 1 c 1 1 a 2 2 b 2 2 a 2 2 b 3 2 c </code></pre> I would like to remove rows that are duplicates in both <code>col.1</code> and <code>col.2</code>, while always keeping the duplicate row that has <code>col.3 == 'a'</code>, otherwise having no preference for the duplicate row that is retained. In the case of this example, the resultant data frame would look like this: <pre class="prettyprint"><code># Output. col.1 col.2 col.3 1 1 a 2 2 a 3 2 c </code></pre> All help is appreciated!

We can order first on <code>col.3</code> and remove duplicates, i.e. <pre class="prettyprint"><code>d1 <- df[with(df, order(col.3)),] d1[!duplicated(d1[c(1, 2)]),] # col.1 col.2 col.3 #3 1 1 a #5 2 2 a #7 3 2 c </code></pre>

Since you want to retain <code>a</code> one option is to <code>arrange</code> them and get the 1st row in each group. <pre class="prettyprint"><code>library(dplyr) df %>% arrange_all() %>% group_by(col.1, col.2) %>% slice(1) # col.1 col.2 col.3 # <dbl> <dbl> <fct> #1 1 1 a #2 2 2 a #3 3 2 c </code></pre> If the <code>col.3</code> values are not sequential, you can manually <code>arrange</code> them by doing <pre class="prettyprint"><code>df %>% arrange(col.1, col.2, match(col.3, c("a", "b", "c"))) %>% group_by(col.1, col.2) %>% slice(1) </code></pre>

Preferential removal of partial duplicates in a dataframe

Tags:

dataframe

r

While removing rows that are duplicates in two particular columns, is it possible to preferentially retain one of the duplicate rows based upon a third column?

Consider the following example:

# Example dataframe.
df <- data.frame(col.1 = c(1, 1, 1, 2, 2, 2, 3),
                 col.2 = c(1, 1, 1, 2, 2, 2, 2),
                 col.3 = c('b', 'c', 'a', 'b', 'a', 'b', 'c'))
# Output
col.1 col.2 col.3
    1     1     b
    1     1     c
    1     1     a
    2     2     b
    2     2     a
    2     2     b
    3     2     c

I would like to remove rows that are duplicates in both col.1 and col.2, while always keeping the duplicate row that has col.3 == 'a', otherwise having no preference for the duplicate row that is retained. In the case of this example, the resultant data frame would look like this:

# Output.
col.1 col.2 col.3
    1     1     a
    2     2     a
    3     2     c

All help is appreciated!

382

asked May 20 '19 13:05

Lorcán

2 Answers

We can order first on col.3 and remove duplicates, i.e.

d1 <- df[with(df, order(col.3)),]
d1[!duplicated(d1[c(1, 2)]),]
#  col.1 col.2 col.3
#3     1     1     a
#5     2     2     a
#7     3     2     c

135

answered Oct 04 '22 18:10

Sotos

Since you want to retain a one option is to arrange them and get the 1st row in each group.

library(dplyr)

df %>%
  arrange_all() %>%
  group_by(col.1, col.2) %>%
  slice(1)

#  col.1 col.2 col.3
#  <dbl> <dbl> <fct>
#1     1     1 a    
#2     2     2 a    
#3     3     2 c

If the col.3 values are not sequential, you can manually arrange them by doing

df %>%
  arrange(col.1, col.2, match(col.3, c("a", "b", "c"))) %>%
  group_by(col.1, col.2) %>%
  slice(1)

answered Oct 04 '22 17:10

Ronak Shah

Related questions
                            
                                Large performance differences between OS for matrix computation
                            
                                How to annotate line plot with arrow and maximum value?
                            
                                Calculating pvalue within a huge data frame takes very long
                            
                                alignment of a flextable on the page when rendering R Markdown to MS Word
                            
                                Extracting numbers from text with stringr and regex in R
                            
                                Assign unique ID per multiple columns of data table
                            
                                how to efficiently import multiple raster (.tif) files into R
                            
                                Use different font sizes for different portions of text in ggplot2 title
                            
                                Bar charts connected by lines / How to connect two graphs arranged with grid.arrange in R / ggplot2
                            
                                Functions and non-standard evaluation in dplyr
                            
                                ggplot2 - Correctly arrange odd number of plots into one figure
                            
                                geom_tile : Clean Diagonal Tiles Border
                            
                                Stacked barchart, independent fill order for each stack
                            
                                Correlations between numerous variables grouped in dplyr
                            
                                Is there an R function to make subset by part of column name?
                            
                                How to replace 0 or missing value with NA in R [duplicate]
                            
                                Get title for plots when using purrr and ggplot with group_by and nest()
                            
                                Manipulate string in column of a dataframe
                            
                                Delete rows from SQL Server table using R (DBI package)
                            
                                convert numbers written in words to numbers using R programming

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With