Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove duplicates based on missing data in another column?

I have a dataset that looks like this:

   Study_ID Recurrent_Status
1       100                1
2       100               NA
3       100               NA
4       200                1
5       300               NA
6       400                3
7       400               NA
8       500                3
9       500               NA
10      600               NA
11      700                1

I would like to remove any Study IDs that are duplicates, but keep the entry where there is data for 'recurrent status'. In other words, I want to remove every duplicate study ID where there is NA for 'recurrent status'. Recurrent status is either a value of 1 or 3 (or NA for some unduplicated patients).

My desired output would look something like this:

  Study_ID Recurrent_Status
1      100                1
2      200                1
3      300               NA
4      400                3
5      500                3
6      600               NA
7      700                1

I've tried to use this code, but it of course removes individuals with a recurrent status of 1 or 3, instead of retaining them.

full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status !="1")
full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status !="3")

When I try to remove the explanation mark, I get this error:

full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status ="1")

Error: unexpected '=' in "full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status ="

How can I go about doing this?

Reproducible data:

data<-data.frame(Study_ID=c("100","100","100","200","300","400","400","500","500","600","700"),Recurrent_Status=c("1","NA","NA","1","NA","3","NA","3","NA","NA","1"))
like image 200
sabc04 Avatar asked Nov 08 '25 03:11

sabc04


2 Answers

We could arrange by the non-NA elements in 'Recurrent_Status' along with the first column and then use distinct

library(dplyr)
data %>% 
  arrange(Study_ID, is.na(Recurrent_Status)) %>%
  distinct(Study_ID, .keep_all = TRUE)

-output

  Study_ID Recurrent_Status
1      100                1
2      200                1
3      300               NA
4      400                3
5      500                3
6      600               NA
7      700                1
like image 184
akrun Avatar answered Nov 10 '25 18:11

akrun


Another dplyr option:

df <- read.table(text = "   Study_ID Recurrent_Status
1       100                1
2       100               NA
3       100               NA
4       200                1
5       300               NA
6       400                3
7       400               NA
8       500                3
9       500               NA
10      600               NA
11      700                1", header = TRUE)

library(dplyr)
df %>%
  group_by(Study_ID) %>%
  slice(which.max(!is.na(Recurrent_Status)))
#> # A tibble: 7 × 2
#> # Groups:   Study_ID [7]
#>   Study_ID Recurrent_Status
#>      <int>            <int>
#> 1      100                1
#> 2      200                1
#> 3      300               NA
#> 4      400                3
#> 5      500                3
#> 6      600               NA
#> 7      700                1

Created on 2022-07-18 by the reprex package (v2.0.1)

like image 43
Quinten Avatar answered Nov 10 '25 17:11

Quinten



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!