Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to delete duplicates from groups in R where group is formed by three columns

Suppose we have a dataframe like this:

PatientID   BookingID   Level1   Level2  Value 
a1           101-A1        1     HBA1C   9.4
a2           102-A2        1     LDL     116
a1           101-A1        1     VLDL    11
a1           101-A1        2     POL     10
a1           102-A1        1     HBA1c   9.4
a2           102-A2        1     VLDL    10
a1           102-A1        1     VLDL    11
a2           103-A2        1     LDL     116
a2           103-A2        1     VLDL    11
a1           102-A1        2     POL     10

The idea is, a patient(Unique Patient ID) can come to lab and get tested for a set of conditions - broader category of test is level1, and their subcategories is level2(level1 has one to many relationship with level2), now the BookingId column for a patient ID can be different attributing to different dates of visits. value column represents values for each test(Patient ID has one to many relationship with BookingID).

So a group is formed by one combination of 'level1', 'patientID' and its 'Value', if the patientID a1 came twice for getting himself tested - 101-A1 and 102-A1 and if all the test values are same, then only one of them remains, if there is any variation like in case of patientID a2 then all the values remains, hence we want an o/p like this:

PatientID   BookingID   Level1   Level2  Value 
a1           101-A1        1     HBA1C    9.4
a2           102-A2        1      LDL     116
a1           101-A1        1      VLDL    11
a1           101-A1        2      POL     10
a2           102-A2        1      VLDL    10
a2           103-A2        1      LDL     116
a2           103-A2        1      VLDL    11

I tried using the following code from Dplyr package, but thats removing stuffs we want:

abcTest1 <- FullData %>% group_by(level1, patientId, value) %>% slice(1)

I cannot seem to do it correctly, please provide inputs

like image 443
Rohan Avatar asked Nov 07 '25 17:11

Rohan


1 Answers

You could also do this:

library(dplyr)
> df
# A tibble: 10 × 5
   PatientID BookingID Level1 Level2 Value
       <chr>     <chr>  <dbl>  <chr> <dbl>
1         a1    101-A1      1  HBA1C   9.4
2         a2    102-A2      1    LDL 116.0
3         a1    101-A1      1   VLDL  11.0
4         a1    101-A1      2    POL  10.0
5         a1    102-A1      1  HBA1c   9.4
6         a2    102-A2      1   VLDL  10.0
7         a1    102-A1      1   VLDL  11.0
8         a2    103-A2      1    LDL 116.0
9         a2    103-A2      1   VLDL  11.0
10        a1    102-A1      2    POL  10.0
> df %>% distinct(PatientID, Level1, Value, .keep_all=TRUE)
# A tibble: 6 × 5
  PatientID BookingID Level1 Level2 Value
      <chr>     <chr>  <dbl>  <chr> <dbl>
1        a1    101-A1      1  HBA1C   9.4
2        a2    102-A2      1    LDL 116.0
3        a1    101-A1      1   VLDL  11.0
4        a1    101-A1      2    POL  10.0
5        a2    102-A2      1   VLDL  10.0
6        a2    103-A2      1   VLDL  11.0

As with @Barker's answer, this doesn't include that last 116.0 row, but that's because your logic doesn't hold. You state that "if the patientID a1 came twice for getting himself tested - 101-A1 and 102-A1 and if all the test values are same, then only one of them remains", which we see in rows 1 and 5, and in your expected out put only row 1 remains. However, the exact same logic holds for rows 2 and 8, yet you want to keep both rows.

New Answer Based on Comments

Okay, I finally understand what you're saying and have been able to decipher the rules needed to properly deduplicate. It's a convoluted process, but would look like this:

library(dplyr)
library(stringr)
library(tidyr)
df %>% 
  group_by(PatientID, BookingID) %>% 
  mutate(Key = paste(Level1, Level2, Value, collapse=";")) %>% 
  ungroup() %>% 
  select(-Level1, -Level2, -Value) %>% 
  distinct(PatientID, Key, .keep_all=TRUE) %>% 
  mutate(Key = str_split(Key, ";")) %>% 
  unnest(Key) %>% 
  separate(Key, into=c("Level1", "Level2", "Value"), 
           sep=" ", remove=TRUE) %>%
  arrange(PatientID, BookingID, Level1, Level2)

# A tibble: 7 × 5
  PatientID BookingID Level1 Level2 Value
*     <chr>     <chr>  <chr>  <chr> <chr>
1        a1    101-A1      1  HBA1C   9.4
2        a1    101-A1      1   VLDL    11
3        a1    101-A1      2    POL    10
4        a2    102-A2      1    LDL   116
5        a2    102-A2      1   VLDL    10
6        a2    103-A2      1    LDL   116
7        a2    103-A2      1   VLDL    11

Basically, we start out by taking the three values that are of most interest to us, Level1, Level2, and Value, and putting into a string where we separate each set of values for each BookingID by a semicolon. This will give us two rows of information for both PatientID == a1 and PatientID == a2, for four rows total. We then deduplicate by only PatientID and Key, but tell R to keep BookingID as well. What we find is that the data is identical for a1, but is different for a2 because Value where Level2 == POL is not the same on both visits. So after we deduplicate, we're left with three rows. We then split Key using the semicolon as the separator, and then separate the Key variable to get our original columns back.

like image 138
tblznbits Avatar answered Nov 10 '25 14:11

tblznbits



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!