Suppose we have a dataframe like this:
PatientID BookingID Level1 Level2 Value
a1 101-A1 1 HBA1C 9.4
a2 102-A2 1 LDL 116
a1 101-A1 1 VLDL 11
a1 101-A1 2 POL 10
a1 102-A1 1 HBA1c 9.4
a2 102-A2 1 VLDL 10
a1 102-A1 1 VLDL 11
a2 103-A2 1 LDL 116
a2 103-A2 1 VLDL 11
a1 102-A1 2 POL 10
The idea is, a patient(Unique Patient ID) can come to lab and get tested for a set of conditions - broader category of test is level1, and their subcategories is level2(level1 has one to many relationship with level2), now the BookingId column for a patient ID can be different attributing to different dates of visits. value column represents values for each test(Patient ID has one to many relationship with BookingID).
So a group is formed by one combination of 'level1', 'patientID' and its 'Value', if the patientID a1 came twice for getting himself tested - 101-A1 and 102-A1 and if all the test values are same, then only one of them remains, if there is any variation like in case of patientID a2 then all the values remains, hence we want an o/p like this:
PatientID BookingID Level1 Level2 Value
a1 101-A1 1 HBA1C 9.4
a2 102-A2 1 LDL 116
a1 101-A1 1 VLDL 11
a1 101-A1 2 POL 10
a2 102-A2 1 VLDL 10
a2 103-A2 1 LDL 116
a2 103-A2 1 VLDL 11
I tried using the following code from Dplyr package, but thats removing stuffs we want:
abcTest1 <- FullData %>% group_by(level1, patientId, value) %>% slice(1)
I cannot seem to do it correctly, please provide inputs
You could also do this:
library(dplyr)
> df
# A tibble: 10 × 5
PatientID BookingID Level1 Level2 Value
<chr> <chr> <dbl> <chr> <dbl>
1 a1 101-A1 1 HBA1C 9.4
2 a2 102-A2 1 LDL 116.0
3 a1 101-A1 1 VLDL 11.0
4 a1 101-A1 2 POL 10.0
5 a1 102-A1 1 HBA1c 9.4
6 a2 102-A2 1 VLDL 10.0
7 a1 102-A1 1 VLDL 11.0
8 a2 103-A2 1 LDL 116.0
9 a2 103-A2 1 VLDL 11.0
10 a1 102-A1 2 POL 10.0
> df %>% distinct(PatientID, Level1, Value, .keep_all=TRUE)
# A tibble: 6 × 5
PatientID BookingID Level1 Level2 Value
<chr> <chr> <dbl> <chr> <dbl>
1 a1 101-A1 1 HBA1C 9.4
2 a2 102-A2 1 LDL 116.0
3 a1 101-A1 1 VLDL 11.0
4 a1 101-A1 2 POL 10.0
5 a2 102-A2 1 VLDL 10.0
6 a2 103-A2 1 VLDL 11.0
As with @Barker's answer, this doesn't include that last 116.0 row, but that's because your logic doesn't hold. You state that "if the patientID a1 came twice for getting himself tested - 101-A1 and 102-A1 and if all the test values are same, then only one of them remains", which we see in rows 1 and 5, and in your expected out put only row 1 remains. However, the exact same logic holds for rows 2 and 8, yet you want to keep both rows.
Okay, I finally understand what you're saying and have been able to decipher the rules needed to properly deduplicate. It's a convoluted process, but would look like this:
library(dplyr)
library(stringr)
library(tidyr)
df %>%
group_by(PatientID, BookingID) %>%
mutate(Key = paste(Level1, Level2, Value, collapse=";")) %>%
ungroup() %>%
select(-Level1, -Level2, -Value) %>%
distinct(PatientID, Key, .keep_all=TRUE) %>%
mutate(Key = str_split(Key, ";")) %>%
unnest(Key) %>%
separate(Key, into=c("Level1", "Level2", "Value"),
sep=" ", remove=TRUE) %>%
arrange(PatientID, BookingID, Level1, Level2)
# A tibble: 7 × 5
PatientID BookingID Level1 Level2 Value
* <chr> <chr> <chr> <chr> <chr>
1 a1 101-A1 1 HBA1C 9.4
2 a1 101-A1 1 VLDL 11
3 a1 101-A1 2 POL 10
4 a2 102-A2 1 LDL 116
5 a2 102-A2 1 VLDL 10
6 a2 103-A2 1 LDL 116
7 a2 103-A2 1 VLDL 11
Basically, we start out by taking the three values that are of most interest to us, Level1, Level2, and Value, and putting into a string where we separate each set of values for each BookingID by a semicolon. This will give us two rows of information for both PatientID == a1 and PatientID == a2, for four rows total. We then deduplicate by only PatientID and Key, but tell R to keep BookingID as well. What we find is that the data is identical for a1, but is different for a2 because Value where Level2 == POL is not the same on both visits. So after we deduplicate, we're left with three rows. We then split Key using the semicolon as the separator, and then separate the Key variable to get our original columns back.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With