I have a large dataset with four columns: question, id, country and response. In the id column I have several duplicates. This is because they refer to the same question, but the responses are different in the sense that one of the duplicates has a value and the other is NA. Now I would like to remove the duplicates and keep those row where the response column has a value. Note that the value in my database are either numeric or character.
I have tried to use distinct()
from the dplyr package. However, the problem is that it deletes always the first row / the first duplicate regardless what stands in the response column.
Here is my code:
df1 %>% distinct(id, country, .keep_all = TRUE)
The output I expect is that I am left with unique id rows (no duplicates are left) and that in the response column no information is lost. See the example below:
#Initial data frame
df1 <- read.table(text="question id country response
X1 10 Belgium 40
X2 12 Austria NA
X2_1 12 Austria NEW
X4 17 USA NA
X5 17 USA 5
X6 NA Italy 61
X7 15 Spain
X8 15 Spain 100", header=TRUE, stringsAsFactors=FALSE)`
#Expected Output
df1 <- read.table(text="question id country response
X1 10 Belgium 40
X2_1 12 Austria NEW
X5 17 USA 5
X6 NA Italy 61
X8 15 Spain 100", header=TRUE, stringsAsFactors=FALSE)`
The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It's an efficient version of the R base function unique() .
Remove duplicates but keep rest of row values with FilterWith a formula and the Filter function, you can quickly remove duplicates but keep rest. 5. Click Data > Filter to disable Filter, and remove the formulas as you need. You can see all duplicates have been removed and the rest of values are kept in the row.
Remove Duplicate rows in R using Dplyr – distinct () function. Distinct function in R is used to remove duplicate rows in R using Dplyr package. Dplyr package in R is provided with distinct() function which eliminate duplicates rows with single variable or with multiple variable.
In Excel, there are several ways to filter for unique values—or remove duplicate values: To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates.
We can do an arrange
to make sure the NA
elements are arranged last and then take the distinct
so that distinct
takes the first non-duplicated rows based on the columns specified
library(dplyr)
df1 %>%
arrange(id, country, is.na(response)) %>%
distinct(id, country, .keep_all = TRUE)
If we need to keep 'id's that NA
without taking the distinct
of those
df1 %>%
arrange(id, country, is.na(response)) %>%
group_by(id, country) %>%
filter(row_number() == 1 | is.na(id))
For this example, even
df1[complete.cases(df1$response),]
In tidyverse
syntax
df1 %>%
filter(complete.cases(response))
would work, but it may not work in the actual dataset
A base R solution could be the following.
i <- !(duplicated(df1$id) & duplicated(df1$id, fromLast = TRUE))
j <- !is.na(df1$response)
df1[i & j, ]
# question id country response
#1 X1 10 Belgium 40
#3 X2_1 12 Austria NEW
#5 X5 17 USA 5
#6 X6 NA Italy 61
#8 X8 15 Spain 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With