Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's an efficient method to extract only the rows with the first occurrence from a data set?

Tags:

r

dplyr

I have a data frame with patient encounters, and want to extract only the oldest encounter for each patient (which can be done using the sequential encounter ID). The code I came up with works, but I'm sure there are more efficient ways to perform this task using dplyr. What approach would you recommend?

Example with 10 encounters for 4 patients:

encounter_ID <- c(1021, 1022, 1013, 1041, 1007, 1002, 1003, 1043, 1085, 1077)
patient_ID <- c(855,721,821,855,423,423,855,721,423,855)
gender <- c(0,0,1,0,1,1,0,0,1,0)
df <- data.frame(encounter_ID, patient_ID, gender)

Result (desired and obtained):

    encounter_ID    patient_ID  gender
    1003            855         0
    1022            721         0
    1013            821         1
    1002            423         1

My approach

1) Extract a list of the unique patients

list.patients <- unique(df$patient_ID)

2) Create an empty data frame to receive our output of the first encounter per patient

one.encounter <- data.frame()

3) Go through each patient on the list to extract their first encounter and populate our data frame

for (i in 1:length(list.patients)) {
one.patient <- df %>% filter(patient_ID==list.patients[i])
one.patient.ordered <- one.patient[order(one.patient$encounter_ID),]
first.encounter <- head(one.patient.ordered, n=1)
one.encounter <- rbind(one.encounter, first.encounter)
} 
like image 644
A. Beal Avatar asked Dec 02 '22 11:12

A. Beal


2 Answers

Here is a base R solution, it is possible to do this efficiently without dplyr

duplicated will code the first row it encounters with a certain patient id as FALSE, and all subsequent rows with that same patient id as TRUE (Here, I've reversed that by adding ! before duplicated), so you can use it to select only the first encounter if you've ordered your dataframe by encounter_ID

df <- df[order(df$encounter_ID),] #order dataframe by encounter id
#subset to rows that are not duplicates of a previous encounter for that patient
first <- df[!duplicated(df$patient_ID),]
like image 178
Jan Boyer Avatar answered Jan 11 '23 23:01

Jan Boyer


Since OP asked for an efficient method in terms of execution time, here is a benchmark of the answers in addition to a data.table way.

enter image description here

#Unit: milliseconds
#            expr        min         lq       mean     median         uq        max neval
#          OP(df) 1354.49200 1398.15245 1481.16068 1467.31151 1531.93056 2124.05586   100
#        Mike(df)  587.33074  606.33194  649.87766  621.65719  658.96548 1076.12302   100
#   Fernandes(df)  177.80735  182.97910  206.64074  185.91444  198.83281  430.96393   100
#       `5th`(df)   60.55170   64.98082   77.55248   67.73171   71.54677  208.47656   100
#       SmitM(df)   52.70000   53.93696   59.05506   54.84035   58.92260  175.24284   100
#   Jan_Boyer(df)   30.70666   33.44665   43.04396   34.46983   35.69736  223.02998   100
#  data_table(df)   11.51547   12.38410   14.60907   13.08038   15.25540   43.71229   100
# Moody_dplyr(df)  234.08792  241.02003  260.19283  245.20301  259.82435  517.03117   100
# Moody_baseR(df)   67.05192   72.00578   89.50914   74.64688   77.58169  299.56125   100

code and data

library(microbenchmark)
library(tidyverse)
library(data.table)

n <- 1e6
set.seed(1)
df <- data.frame(encounter_ID = sample(1000:1999, size = n, replace = TRUE), 
                 patient_ID = sample(700:900, n, TRUE), 
                 gender = sample(0:1, n, TRUE))

benchmark <- microbenchmark(
  OP(df),
  Mike(df),
  Fernandes(df),
  `5th`(df),
  SmitM(df),
  Jan_Boyer(df),
  data_table(df),
  Moody_dplyr(df),
  Moody_baseR(df)
)

autoplot(benchmark)

The solutions so far.

Mike <- function(df) {
  df %>%  
    arrange(patient_ID, encounter_ID) %>% 
    group_by(patient_ID) %>% 
    filter(row_number()==1)
}

SmitM <- function(df) {
  df %>% 
    group_by(patient_ID, gender) %>% 
    summarise(encounter_ID = min(encounter_ID))
}

Fernandes <- function(df) {
  x <- dplyr::arrange(df, encounter_ID)
  x[!duplicated(x$patient_ID),]
}

`5th` <- function(df) {
  df_ordered <- df[order(df$patient_ID, df$encounter_ID), ]
  df_ordered[match(unique(df_ordered$patient_ID), df_ordered$patient_ID), ]
}

Jan_Boyer <- function(df) {
  df <- df[order(df$encounter_ID),] 
  df[!duplicated(df$patient_ID),]
}

data_table <- function(df) {
  setDT(df, key = 'encounter_ID')
  df[df[, .I[1], by = patient_ID]$V1]
}

OP <- function(df) {
  list.patients <- unique(df$patient_ID)
  one.encounter <- data.frame()

  for (i in 1:length(list.patients)) {
    one.patient <- df %>% filter(patient_ID == list.patients[i])
    one.patient.ordered <- one.patient[order(one.patient$encounter_ID), ]
    first.encounter <- head(one.patient.ordered, n = 1)
    one.encounter <- rbind(one.encounter, first.encounter)
  } 
}

Moody_dplyr <- function(df) {
  df %>% group_by(patient_ID) %>% top_n(-1,encounter_ID)
}

Moody_baseR <- function(df) {
  subset(df, as.logical(ave(encounter_ID, patient_ID, FUN = function(x) x == min(x))))
}
like image 45
markus Avatar answered Jan 11 '23 22:01

markus