I have a data frame with patient encounters, and want to extract only the oldest encounter for each patient (which can be done using the sequential encounter ID). The code I came up with works, but I'm sure there are more efficient ways to perform this task using dplyr. What approach would you recommend?
Example with 10 encounters for 4 patients:
encounter_ID <- c(1021, 1022, 1013, 1041, 1007, 1002, 1003, 1043, 1085, 1077)
patient_ID <- c(855,721,821,855,423,423,855,721,423,855)
gender <- c(0,0,1,0,1,1,0,0,1,0)
df <- data.frame(encounter_ID, patient_ID, gender)
Result (desired and obtained):
encounter_ID patient_ID gender
1003 855 0
1022 721 0
1013 821 1
1002 423 1
My approach
1) Extract a list of the unique patients
list.patients <- unique(df$patient_ID)
2) Create an empty data frame to receive our output of the first encounter per patient
one.encounter <- data.frame()
3) Go through each patient on the list to extract their first encounter and populate our data frame
for (i in 1:length(list.patients)) {
one.patient <- df %>% filter(patient_ID==list.patients[i])
one.patient.ordered <- one.patient[order(one.patient$encounter_ID),]
first.encounter <- head(one.patient.ordered, n=1)
one.encounter <- rbind(one.encounter, first.encounter)
}
Here is a base R solution, it is possible to do this efficiently without dplyr
duplicated
will code the first row it encounters with a certain patient id as FALSE
, and all subsequent rows with that same patient id as TRUE
(Here, I've reversed that by adding !
before duplicated
), so you can use it to select only the first encounter if you've ordered your dataframe by encounter_ID
df <- df[order(df$encounter_ID),] #order dataframe by encounter id
#subset to rows that are not duplicates of a previous encounter for that patient
first <- df[!duplicated(df$patient_ID),]
Since OP asked for an efficient method in terms of execution time, here is a benchmark of the answers in addition to a data.table
way.
#Unit: milliseconds
# expr min lq mean median uq max neval
# OP(df) 1354.49200 1398.15245 1481.16068 1467.31151 1531.93056 2124.05586 100
# Mike(df) 587.33074 606.33194 649.87766 621.65719 658.96548 1076.12302 100
# Fernandes(df) 177.80735 182.97910 206.64074 185.91444 198.83281 430.96393 100
# `5th`(df) 60.55170 64.98082 77.55248 67.73171 71.54677 208.47656 100
# SmitM(df) 52.70000 53.93696 59.05506 54.84035 58.92260 175.24284 100
# Jan_Boyer(df) 30.70666 33.44665 43.04396 34.46983 35.69736 223.02998 100
# data_table(df) 11.51547 12.38410 14.60907 13.08038 15.25540 43.71229 100
# Moody_dplyr(df) 234.08792 241.02003 260.19283 245.20301 259.82435 517.03117 100
# Moody_baseR(df) 67.05192 72.00578 89.50914 74.64688 77.58169 299.56125 100
code and data
library(microbenchmark)
library(tidyverse)
library(data.table)
n <- 1e6
set.seed(1)
df <- data.frame(encounter_ID = sample(1000:1999, size = n, replace = TRUE),
patient_ID = sample(700:900, n, TRUE),
gender = sample(0:1, n, TRUE))
benchmark <- microbenchmark(
OP(df),
Mike(df),
Fernandes(df),
`5th`(df),
SmitM(df),
Jan_Boyer(df),
data_table(df),
Moody_dplyr(df),
Moody_baseR(df)
)
autoplot(benchmark)
The solutions so far.
Mike <- function(df) {
df %>%
arrange(patient_ID, encounter_ID) %>%
group_by(patient_ID) %>%
filter(row_number()==1)
}
SmitM <- function(df) {
df %>%
group_by(patient_ID, gender) %>%
summarise(encounter_ID = min(encounter_ID))
}
Fernandes <- function(df) {
x <- dplyr::arrange(df, encounter_ID)
x[!duplicated(x$patient_ID),]
}
`5th` <- function(df) {
df_ordered <- df[order(df$patient_ID, df$encounter_ID), ]
df_ordered[match(unique(df_ordered$patient_ID), df_ordered$patient_ID), ]
}
Jan_Boyer <- function(df) {
df <- df[order(df$encounter_ID),]
df[!duplicated(df$patient_ID),]
}
data_table <- function(df) {
setDT(df, key = 'encounter_ID')
df[df[, .I[1], by = patient_ID]$V1]
}
OP <- function(df) {
list.patients <- unique(df$patient_ID)
one.encounter <- data.frame()
for (i in 1:length(list.patients)) {
one.patient <- df %>% filter(patient_ID == list.patients[i])
one.patient.ordered <- one.patient[order(one.patient$encounter_ID), ]
first.encounter <- head(one.patient.ordered, n = 1)
one.encounter <- rbind(one.encounter, first.encounter)
}
}
Moody_dplyr <- function(df) {
df %>% group_by(patient_ID) %>% top_n(-1,encounter_ID)
}
Moody_baseR <- function(df) {
subset(df, as.logical(ave(encounter_ID, patient_ID, FUN = function(x) x == min(x))))
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With