R: How do I choose which row dplyr::distinct() keeps based on a value in another variable?

Question

The real life problem: I have subjects with MRI scan data. Some of which have been scanned multiple times (separate rows). Some of those were scanned under different protocols each time. I want to keep all unique rows by subject ID, and if a subject was scanned under two different protocols, I want it to prefer one over the other.

The toy example:

library(dplyr)  
df <- tibble(
        id = c("A", "A", "B", "C", "C", "D"), 
        protocol = c("X", "Y", "X", "X", "X", "Y"),
        date = c(seq(as.Date("2018-01-01"), as.Date("2018-01-06"), 
                 by="days")),
        var = 1:6)

I want to return a data frame with all unique subjects by id. When it comes to a duplicate value, instead of automatically keeping the first entry, I want it to keep the entry with "Y" as the protocol if it has that choice, but not to get rid of rows with "X" otherwise.

In the example, it would keep rows 2, 3, 4, and 6.

I prefer dplyr, but am open to other suggestions.

Nothing that I've tried even begins to work:

df %>% distinct(id, .keep_all = TRUE) #Nope! 

df %>% distinct(id, protocol == "Y", .keep_all = TRUE) #Nope!  

df$protocol <- factor(df$protocol, levels = c("Y", "X"))
df %>% distinct(id, .keep_all = TRUE) #Nope!  

df %>% group_by(id) %>% filter(protocol == "Y") #Nope!

Two good answers: @RobJensen suggests

df %>% arrange(id, desc(protocol == 'Y')) %>% distinct(id, .keep_all = TRUE)

If I have multiple protocols and wish to assign an order to which they will be chosen, I can create a new variable where I assign the protocols an integer in order of preference, then use the suggestion from @joran

df %>% group_by(id) %>% arrange(desc(protocol),var) %>% slice(1)

Thanks!

IceCreamToucan · Accepted Answer

Arranging alphabetically works in the stated simple case, but if you want you can add a protocol_preference variable to give an ordering of what you'd prefer to be selected if Y isn't available, and to select "Y" even if it doesn't happen to be the last protocol value when sorted alphabetically.

Building off @davechilders answer and @Nathan Werth 's idea of creating a factor based on an "order of importance" vector

order_of_importance <- c("Y", "Z", "X")

    df2 %>%
      mutate(protocol = factor(protocol, order_of_importance)) %>%
      arrange(id, protocol) %>%
      distinct(id, .keep_all = TRUE)

Or if you just want to select 'Y' and don't have a preference for what's selected if 'Y' isn't avaialable you can do

df %>% 
    arrange(id, desc(protocol == 'Y')) %>% 
    distinct(id, .keep_all = TRUE)

davechilders · Answer

You can achieve this without using group_by() if you want the output to be a tibble that is not a grouped_df.

df %>% arrange(id, desc(protocol)) %>% distinct(id, .keep_all = TRUE)

joran · Answer

There's probably a faster way (almost certainly with data.table) but this would be the naive direct approach in dplyr I think:

df %>% group_by(id) %>% arrange(desc(protocol),var) %>% do(head(.,1))

As @Gregor noted below (now deleted), slice(1) is probably a better idiom for do(head(.,1)).

Nathan Werth · Answer

You could break the process into two steps: grab the must-haves, grab whatever for the other IDs, and combine.

distinct_y <- df %>%
  filter(protocol == "Y") %>%
  distinct(id, .keep_all = TRUE)

distinct_other <- df %>%
  anti_join(distinct_y, "id") %>%
  distinct(id, .keep_all = TRUE)

distinct_combined <- rbind(distinct_y, distinct_other)

If you'd like to generalize it from a "one above all" to an ordering of value, I suggest making protocol a factor.

For example, suppose there are three protocols: X, Y, and Z. Y is the best, Z is better than X, and you only want X if there's nothing better.

# Only difference is the best protocol for C will now be Z.
df2 <- tibble(
  id = c("A", "A", "B", "C", "C", "D"),
  protocol = c("X", "Y", "X", "X", "Z", "Y"),
  date = c(seq(as.Date("2018-01-01"), as.Date("2018-01-06"),
               by="days")),
  var = 1:6
)

order_of_importance <- c("Y", "Z", "X")

df2 %>%
  mutate(protocol = factor(protocol, order_of_importance)) %>%
  group_by(id) %>%
  arrange(protocol) %>%
  slice(1)
# # A tibble: 4 x 4
# # Groups: id [4]
#   id    protocol date         var
#   <chr> <fctr>   <date>     <int>
# 1 A     Y        2018-01-02     2
# 2 B     X        2018-01-03     3
# 3 C     Z        2018-01-05     5
# 4 D     Y        2018-01-06     6

R: How do I choose which row dplyr::distinct() keeps based on a value in another variable?

Tags:

r

unique

dplyr

distinct-values

Bart

4 Answers

IceCreamToucan

davechilders

joran

Nathan Werth

Recent Activity

Donate For Us

R: How do I choose which row dplyr::distinct() keeps based on a value in another variable?

Tags:

r

unique

dplyr

distinct-values

Bart

4 Answers

IceCreamToucan

davechilders

joran

Nathan Werth

Related questions

Recent Activity

Donate For Us