Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R remove duplicate rows keeping those with values

Tags:

r

I have a large dataset with four columns: question, id, country and response. In the id column I have several duplicates. This is because they refer to the same question, but the responses are different in the sense that one of the duplicates has a value and the other is NA. Now I would like to remove the duplicates and keep those row where the response column has a value. Note that the value in my database are either numeric or character.

I have tried to use distinct() from the dplyr package. However, the problem is that it deletes always the first row / the first duplicate regardless what stands in the response column.

Here is my code:

df1 %>% distinct(id, country, .keep_all = TRUE)

The output I expect is that I am left with unique id rows (no duplicates are left) and that in the response column no information is lost. See the example below:

    #Initial data frame
    df1  <- read.table(text="question id  country response
                              X1    10  Belgium    40
                              X2    12  Austria    NA
                              X2_1  12  Austria    NEW
                              X4    17  USA        NA
                              X5    17  USA        5
                              X6    NA  Italy      61
                              X7    15  Spain      
                              X8    15  Spain      100", header=TRUE, stringsAsFactors=FALSE)`


    #Expected Output
    df1  <- read.table(text="question id  country response
                              X1    10  Belgium    40
                              X2_1  12  Austria    NEW
                              X5    17  USA        5
                              X6    NA  Italy      61
                              X8    15  Spain      100", header=TRUE, stringsAsFactors=FALSE)`
like image 681
user9660581 Avatar asked May 23 '19 15:05

user9660581


People also ask

How do I keep unique rows in R?

The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It's an efficient version of the R base function unique() .

How do I remove duplicates but keep rows?

Remove duplicates but keep rest of row values with FilterWith a formula and the Filter function, you can quickly remove duplicates but keep rest. 5. Click Data > Filter to disable Filter, and remove the formulas as you need. You can see all duplicates have been removed and the rest of values are kept in the row.

How do I remove duplicate rows in R?

Remove Duplicate rows in R using Dplyr – distinct () function. Distinct function in R is used to remove duplicate rows in R using Dplyr package. Dplyr package in R is provided with distinct() function which eliminate duplicates rows with single variable or with multiple variable.

How do you remove duplicate records based on values?

In Excel, there are several ways to filter for unique values—or remove duplicate values: To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates.


2 Answers

We can do an arrange to make sure the NA elements are arranged last and then take the distinct so that distinct takes the first non-duplicated rows based on the columns specified

library(dplyr)
df1 %>%
   arrange(id, country, is.na(response)) %>% 
   distinct(id, country, .keep_all = TRUE)

If we need to keep 'id's that NA without taking the distinct of those

df1 %>% 
   arrange(id, country, is.na(response)) %>% 
   group_by(id, country) %>%
   filter(row_number() == 1 | is.na(id))

For this example, even

df1[complete.cases(df1$response),]

In tidyverse syntax

df1 %>% 
    filter(complete.cases(response))

would work, but it may not work in the actual dataset

like image 198
akrun Avatar answered Nov 05 '22 22:11

akrun


A base R solution could be the following.

i <- !(duplicated(df1$id) & duplicated(df1$id, fromLast = TRUE))
j <- !is.na(df1$response)
df1[i & j, ]  
#  question id country response
#1       X1 10 Belgium       40
#3     X2_1 12 Austria      NEW
#5       X5 17     USA        5
#6       X6 NA   Italy       61
#8       X8 15   Spain      100
like image 23
Rui Barradas Avatar answered Nov 06 '22 00:11

Rui Barradas