df <- data.frame(id = c(1, 1, 1, 2, 2),
gender = c("Female", "Female", "Male", "Female", "Male"),
variant = c("a", "b", "c", "d", "e"))
> df
id gender variant
1 1 Female a
2 1 Female b
3 1 Male c
4 2 Female d
5 2 Male e
I want to remove duplicate rows in my data.frame according to the gender
column in my data set. I know there has been a similar question asked (here) but the difference here is that I would like to remove duplicate rows within each subset of the data set, where each subset is defined by an unique id
.
My desired result is this:
id gender variant
1 1 Female a
3 1 Male c
4 2 Female d
5 2 Male e
I've tried the following and it works, but I'm wondering if there's a cleaner, more efficient way of doing this?
out = list()
for(i in 1:2){
df2 <- subset(df, id == i)
out[[i]] <- df2[!duplicated(df2$gender), ]
}
do.call(rbind.data.frame, out)
Removing duplicate rows based on the Single Columndistinct() function can be used to filter out the duplicate rows. We just have to pass our R object and the column name as an argument in the distinct() function.
df[!duplicated(df[ , c("id","gender")]),]
# id gender variant
# 1 1 Female a
# 3 1 Male c
# 4 2 Female d
# 5 2 Male e
Another way of doing this using subset
as below:
subset(df, !duplicated(subset(df, select=c(id, gender))))
# id gender variant
# 1 1 Female a
# 3 1 Male c
# 4 2 Female d
# 5 2 Male e
Here's a dplyr
based solution in case you are interested (edited to include Gregor's suggestions)
library(dplyr)
group_by(df, id, gender) %>% slice(1)
#> # A tibble: 4 x 3
#> # Groups: id, gender [4]
#> id gender variant
#> <dbl> <fctr> <fctr>
#> 1 1 Female a
#> 2 1 Male c
#> 3 2 Female d
#> 4 2 Male e
It might also be worth using the arrange
function as well depending on which values of variant
should be removed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With