Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: How do I choose which row dplyr::distinct() keeps based on a value in another variable?

The real life problem: I have subjects with MRI scan data. Some of which have been scanned multiple times (separate rows). Some of those were scanned under different protocols each time. I want to keep all unique rows by subject ID, and if a subject was scanned under two different protocols, I want it to prefer one over the other.

The toy example:

library(dplyr)  
df <- tibble(
        id = c("A", "A", "B", "C", "C", "D"), 
        protocol = c("X", "Y", "X", "X", "X", "Y"),
        date = c(seq(as.Date("2018-01-01"), as.Date("2018-01-06"), 
                 by="days")),
        var = 1:6)

I want to return a data frame with all unique subjects by id. When it comes to a duplicate value, instead of automatically keeping the first entry, I want it to keep the entry with "Y" as the protocol if it has that choice, but not to get rid of rows with "X" otherwise.

In the example, it would keep rows 2, 3, 4, and 6.

I prefer dplyr, but am open to other suggestions.

Nothing that I've tried even begins to work:

df %>% distinct(id, .keep_all = TRUE) #Nope! 

df %>% distinct(id, protocol == "Y", .keep_all = TRUE) #Nope!  

df$protocol <- factor(df$protocol, levels = c("Y", "X"))
df %>% distinct(id, .keep_all = TRUE) #Nope!  

df %>% group_by(id) %>% filter(protocol == "Y") #Nope!

Two good answers: @RobJensen suggests

df %>% arrange(id, desc(protocol == 'Y')) %>% distinct(id, .keep_all = TRUE)  

If I have multiple protocols and wish to assign an order to which they will be chosen, I can create a new variable where I assign the protocols an integer in order of preference, then use the suggestion from @joran

df %>% group_by(id) %>% arrange(desc(protocol),var) %>% slice(1)  

Thanks!

like image 967
Bart Avatar asked Jan 22 '18 20:01

Bart


4 Answers

Arranging alphabetically works in the stated simple case, but if you want you can add a protocol_preference variable to give an ordering of what you'd prefer to be selected if Y isn't available, and to select "Y" even if it doesn't happen to be the last protocol value when sorted alphabetically.

Building off @davechilders answer and @Nathan Werth 's idea of creating a factor based on an "order of importance" vector

order_of_importance <- c("Y", "Z", "X")

    df2 %>%
      mutate(protocol = factor(protocol, order_of_importance)) %>%
      arrange(id, protocol) %>%
      distinct(id, .keep_all = TRUE)

Or if you just want to select 'Y' and don't have a preference for what's selected if 'Y' isn't avaialable you can do

df %>% 
    arrange(id, desc(protocol == 'Y')) %>% 
    distinct(id, .keep_all = TRUE)
like image 84
IceCreamToucan Avatar answered Oct 11 '22 07:10

IceCreamToucan


You can achieve this without using group_by() if you want the output to be a tibble that is not a grouped_df.

df %>% arrange(id, desc(protocol)) %>% distinct(id, .keep_all = TRUE)
like image 44
davechilders Avatar answered Oct 11 '22 09:10

davechilders


There's probably a faster way (almost certainly with data.table) but this would be the naive direct approach in dplyr I think:

df %>% group_by(id) %>% arrange(desc(protocol),var) %>% do(head(.,1))

As @Gregor noted below (now deleted), slice(1) is probably a better idiom for do(head(.,1)).

like image 33
joran Avatar answered Oct 11 '22 07:10

joran


You could break the process into two steps: grab the must-haves, grab whatever for the other IDs, and combine.

distinct_y <- df %>%
  filter(protocol == "Y") %>%
  distinct(id, .keep_all = TRUE)

distinct_other <- df %>%
  anti_join(distinct_y, "id") %>%
  distinct(id, .keep_all = TRUE)

distinct_combined <- rbind(distinct_y, distinct_other)

If you'd like to generalize it from a "one above all" to an ordering of value, I suggest making protocol a factor.

For example, suppose there are three protocols: X, Y, and Z. Y is the best, Z is better than X, and you only want X if there's nothing better.

# Only difference is the best protocol for C will now be Z.
df2 <- tibble(
  id = c("A", "A", "B", "C", "C", "D"),
  protocol = c("X", "Y", "X", "X", "Z", "Y"),
  date = c(seq(as.Date("2018-01-01"), as.Date("2018-01-06"),
               by="days")),
  var = 1:6
)

order_of_importance <- c("Y", "Z", "X")

df2 %>%
  mutate(protocol = factor(protocol, order_of_importance)) %>%
  group_by(id) %>%
  arrange(protocol) %>%
  slice(1)
# # A tibble: 4 x 4
# # Groups: id [4]
#   id    protocol date         var
#   <chr> <fctr>   <date>     <int>
# 1 A     Y        2018-01-02     2
# 2 B     X        2018-01-03     3
# 3 C     Z        2018-01-05     5
# 4 D     Y        2018-01-06     6
like image 37
Nathan Werth Avatar answered Oct 11 '22 08:10

Nathan Werth