How to remove duplicates based on missing data in another column?

Question

I have a dataset that looks like this:

   Study_ID Recurrent_Status
1       100                1
2       100               NA
3       100               NA
4       200                1
5       300               NA
6       400                3
7       400               NA
8       500                3
9       500               NA
10      600               NA
11      700                1

I would like to remove any Study IDs that are duplicates, but keep the entry where there is data for 'recurrent status'. In other words, I want to remove every duplicate study ID where there is NA for 'recurrent status'. Recurrent status is either a value of 1 or 3 (or NA for some unduplicated patients).

My desired output would look something like this:

  Study_ID Recurrent_Status
1      100                1
2      200                1
3      300               NA
4      400                3
5      500                3
6      600               NA
7      700                1

I've tried to use this code, but it of course removes individuals with a recurrent status of 1 or 3, instead of retaining them.

full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status !="1")
full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status !="3")

When I try to remove the explanation mark, I get this error:

full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status ="1")

Error: unexpected '=' in "full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status ="

How can I go about doing this?

Reproducible data:

data<-data.frame(Study_ID=c("100","100","100","200","300","400","400","500","500","600","700"),Recurrent_Status=c("1","NA","NA","1","NA","3","NA","3","NA","NA","1"))

akrun · Accepted Answer

We could arrange by the non-NA elements in 'Recurrent_Status' along with the first column and then use distinct

library(dplyr)
data %>% 
  arrange(Study_ID, is.na(Recurrent_Status)) %>%
  distinct(Study_ID, .keep_all = TRUE)

-output

  Study_ID Recurrent_Status
1      100                1
2      200                1
3      300               NA
4      400                3
5      500                3
6      600               NA
7      700                1

Quinten · Answer

Another dplyr option:

df <- read.table(text = "   Study_ID Recurrent_Status
1       100                1
2       100               NA
3       100               NA
4       200                1
5       300               NA
6       400                3
7       400               NA
8       500                3
9       500               NA
10      600               NA
11      700                1", header = TRUE)

library(dplyr)
df %>%
  group_by(Study_ID) %>%
  slice(which.max(!is.na(Recurrent_Status)))
#> # A tibble: 7 × 2
#> # Groups:   Study_ID [7]
#>   Study_ID Recurrent_Status
#>      <int>            <int>
#> 1      100                1
#> 2      200                1
#> 3      300               NA
#> 4      400                3
#> 5      500                3
#> 6      600               NA
#> 7      700                1

^{Created on 2022-07-18 by the reprex package (v2.0.1)}

How to remove duplicates based on missing data in another column?

Tags:

r

duplicates

filter

dplyr

sabc04

2 Answers

akrun

Quinten

Recent Activity

Donate For Us

How to remove duplicates based on missing data in another column?

Tags:

r

duplicates

filter

dplyr

sabc04

2 Answers

akrun

Quinten

Related questions

Recent Activity

Donate For Us