I have a large dataset with four columns: question, id, country and response. In the id column I have several duplicates. This is because they refer to the same question, but the responses are different in the sense that one of the duplicates has a value and the other is NA. Now I would like to remove the duplicates and keep those row where the response column has a value. Note that the value in my database are either numeric or character. I have tried to use <code>distinct()</code> from the dplyr package. However, the problem is that it deletes always the first row / the first duplicate regardless what stands in the response column. Here is my code: <code>df1 %>% distinct(id, country, .keep_all = TRUE)</code> The output I expect is that I am left with unique id rows (no duplicates are left) and that in the response column no information is lost. See the example below: <pre class="prettyprint"><code> #Initial data frame df1 <- read.table(text="question id country response X1 10 Belgium 40 X2 12 Austria NA X2_1 12 Austria NEW X4 17 USA NA X5 17 USA 5 X6 NA Italy 61 X7 15 Spain X8 15 Spain 100", header=TRUE, stringsAsFactors=FALSE)` #Expected Output df1 <- read.table(text="question id country response X1 10 Belgium 40 X2_1 12 Austria NEW X5 17 USA 5 X6 NA Italy 61 X8 15 Spain 100", header=TRUE, stringsAsFactors=FALSE)` </code></pre>

We can do an <code>arrange</code> to make sure the <code>NA</code> elements are arranged last and then take the <code>distinct</code> so that <code>distinct</code> takes the first non-duplicated rows based on the columns specified <pre class="prettyprint"><code>library(dplyr) df1 %>% arrange(id, country, is.na(response)) %>% distinct(id, country, .keep_all = TRUE) </code></pre> If we need to keep 'id's that <code>NA</code> without taking the <code>distinct</code> of those <pre class="prettyprint"><code>df1 %>% arrange(id, country, is.na(response)) %>% group_by(id, country) %>% filter(row_number() == 1 | is.na(id)) </code></pre> <hr> For this example, even <pre class="prettyprint"><code>df1[complete.cases(df1$response),] </code></pre> In <code>tidyverse</code> syntax <pre class="prettyprint"><code>df1 %>% filter(complete.cases(response)) </code></pre> would work, but it may not work in the actual dataset

A base R solution could be the following. <pre class="prettyprint"><code>i <- !(duplicated(df1$id) & duplicated(df1$id, fromLast = TRUE)) j <- !is.na(df1$response) df1[i & j, ] # question id country response #1 X1 10 Belgium 40 #3 X2_1 12 Austria NEW #5 X5 17 USA 5 #6 X6 NA Italy 61 #8 X8 15 Spain 100 </code></pre>

R remove duplicate rows keeping those with values

Tags:

r

I have a large dataset with four columns: question, id, country and response. In the id column I have several duplicates. This is because they refer to the same question, but the responses are different in the sense that one of the duplicates has a value and the other is NA. Now I would like to remove the duplicates and keep those row where the response column has a value. Note that the value in my database are either numeric or character.

I have tried to use distinct() from the dplyr package. However, the problem is that it deletes always the first row / the first duplicate regardless what stands in the response column.

Here is my code:

df1 %>% distinct(id, country, .keep_all = TRUE)

The output I expect is that I am left with unique id rows (no duplicates are left) and that in the response column no information is lost. See the example below:

    #Initial data frame
    df1  <- read.table(text="question id  country response
                              X1    10  Belgium    40
                              X2    12  Austria    NA
                              X2_1  12  Austria    NEW
                              X4    17  USA        NA
                              X5    17  USA        5
                              X6    NA  Italy      61
                              X7    15  Spain      
                              X8    15  Spain      100", header=TRUE, stringsAsFactors=FALSE)`


    #Expected Output
    df1  <- read.table(text="question id  country response
                              X1    10  Belgium    40
                              X2_1  12  Austria    NEW
                              X5    17  USA        5
                              X6    NA  Italy      61
                              X8    15  Spain      100", header=TRUE, stringsAsFactors=FALSE)`

681

asked May 23 '19 15:05

user9660581

2 Answers

We can do an arrange to make sure the NA elements are arranged last and then take the distinct so that distinct takes the first non-duplicated rows based on the columns specified

library(dplyr)
df1 %>%
   arrange(id, country, is.na(response)) %>% 
   distinct(id, country, .keep_all = TRUE)

If we need to keep 'id's that NA without taking the distinct of those

df1 %>% 
   arrange(id, country, is.na(response)) %>% 
   group_by(id, country) %>%
   filter(row_number() == 1 | is.na(id))

For this example, even

df1[complete.cases(df1$response),]

In tidyverse syntax

df1 %>% 
    filter(complete.cases(response))

would work, but it may not work in the actual dataset

198

answered Nov 05 '22 22:11

akrun

A base R solution could be the following.

i <- !(duplicated(df1$id) & duplicated(df1$id, fromLast = TRUE))
j <- !is.na(df1$response)
df1[i & j, ]  
#  question id country response
#1       X1 10 Belgium       40
#3     X2_1 12 Austria      NEW
#5       X5 17     USA        5
#6       X6 NA   Italy       61
#8       X8 15   Spain      100

answered Nov 06 '22 00:11

Rui Barradas

Related questions
                            
                                Line at the top of a ridgeline density plot is cut off
                            
                                Force Plotly correlation heatmap colorscale to be white at zero - R
                            
                                Fill numeric variable while preserving group
                            
                                Split date range into several chunks ending by YYYY-12-31
                            
                                Conditional increment tidyverse
                            
                                Delete characters from a column 'n' characters after the given condition in R
                            
                                How to update data in shiny app periodically?
                            
                                dplyr mutate new dynamic variables with case_when
                            
                                NA filling only if "sandwiched" by the same value using dplyr
                            
                                R dplyr: change the row value of columns having an specific name
                            
                                R filterings rows that contain a combination of words
                            
                                r sf package centroid within polygon
                            
                                How to show formatted R output with results='asis' in rmarkdown
                            
                                How to expand ggplot y axis limits to include maximum value
                            
                                What is the opposite function of max.col (R language)
                            
                                Efficient way to fill column with numbers that identify observations with same value in column [duplicate]
                            
                                How to bind rows without losing those with character(0)?
                            
                                Apply color brewer to a single line in ggplot
                            
                                Override horizontal positioning with ggrepel
                            
                                grepl on two vectors element by element

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With