The real life problem: I have subjects with MRI scan data. Some of which have been scanned multiple times (separate rows). Some of those were scanned under different protocols each time. I want to keep all unique rows by subject ID, and if a subject was scanned under two different protocols, I want it to prefer one over the other.
The toy example:
library(dplyr)
df <- tibble(
id = c("A", "A", "B", "C", "C", "D"),
protocol = c("X", "Y", "X", "X", "X", "Y"),
date = c(seq(as.Date("2018-01-01"), as.Date("2018-01-06"),
by="days")),
var = 1:6)
I want to return a data frame with all unique subjects by id. When it comes to a duplicate value, instead of automatically keeping the first entry, I want it to keep the entry with "Y" as the protocol if it has that choice, but not to get rid of rows with "X" otherwise.
In the example, it would keep rows 2, 3, 4, and 6.
I prefer dplyr, but am open to other suggestions.
Nothing that I've tried even begins to work:
df %>% distinct(id, .keep_all = TRUE) #Nope!
df %>% distinct(id, protocol == "Y", .keep_all = TRUE) #Nope!
df$protocol <- factor(df$protocol, levels = c("Y", "X"))
df %>% distinct(id, .keep_all = TRUE) #Nope!
df %>% group_by(id) %>% filter(protocol == "Y") #Nope!
Two good answers: @RobJensen suggests
df %>% arrange(id, desc(protocol == 'Y')) %>% distinct(id, .keep_all = TRUE)
If I have multiple protocols and wish to assign an order to which they will be chosen, I can create a new variable where I assign the protocols an integer in order of preference, then use the suggestion from @joran
df %>% group_by(id) %>% arrange(desc(protocol),var) %>% slice(1)
Thanks!
Arranging alphabetically works in the stated simple case, but if you want you can add a protocol_preference
variable to give an ordering of what you'd prefer to be selected if Y
isn't available, and to select "Y" even if it doesn't happen to be the last protocol value when sorted alphabetically.
Building off @davechilders answer and @Nathan Werth 's idea of creating a factor based on an "order of importance" vector
order_of_importance <- c("Y", "Z", "X")
df2 %>%
mutate(protocol = factor(protocol, order_of_importance)) %>%
arrange(id, protocol) %>%
distinct(id, .keep_all = TRUE)
Or if you just want to select 'Y' and don't have a preference for what's selected if 'Y' isn't avaialable you can do
df %>%
arrange(id, desc(protocol == 'Y')) %>%
distinct(id, .keep_all = TRUE)
You can achieve this without using group_by()
if you want the output to be a tibble that is not a grouped_df.
df %>% arrange(id, desc(protocol)) %>% distinct(id, .keep_all = TRUE)
There's probably a faster way (almost certainly with data.table) but this would be the naive direct approach in dplyr I think:
df %>% group_by(id) %>% arrange(desc(protocol),var) %>% do(head(.,1))
As @Gregor noted below (now deleted), slice(1)
is probably a better idiom for do(head(.,1))
.
You could break the process into two steps: grab the must-haves, grab whatever for the other IDs, and combine.
distinct_y <- df %>%
filter(protocol == "Y") %>%
distinct(id, .keep_all = TRUE)
distinct_other <- df %>%
anti_join(distinct_y, "id") %>%
distinct(id, .keep_all = TRUE)
distinct_combined <- rbind(distinct_y, distinct_other)
If you'd like to generalize it from a "one above all" to an ordering of value, I suggest making protocol
a factor.
For example, suppose there are three protocols: X, Y, and Z. Y is the best, Z is better than X, and you only want X if there's nothing better.
# Only difference is the best protocol for C will now be Z.
df2 <- tibble(
id = c("A", "A", "B", "C", "C", "D"),
protocol = c("X", "Y", "X", "X", "Z", "Y"),
date = c(seq(as.Date("2018-01-01"), as.Date("2018-01-06"),
by="days")),
var = 1:6
)
order_of_importance <- c("Y", "Z", "X")
df2 %>%
mutate(protocol = factor(protocol, order_of_importance)) %>%
group_by(id) %>%
arrange(protocol) %>%
slice(1)
# # A tibble: 4 x 4
# # Groups: id [4]
# id protocol date var
# <chr> <fctr> <date> <int>
# 1 A Y 2018-01-02 2
# 2 B X 2018-01-03 3
# 3 C Z 2018-01-05 5
# 4 D Y 2018-01-06 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With