Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

calculate days per year from data frame

Tags:

r

I have a data frame similar to this:

df<-read.csv(text="id;census;startDate;endDate
ZF001;died;16.10.2012;16.05.2015
ZF002;alive;20.10.2013
ZF003;alive;04.11.2013;
ZF004;died;11.11.2013;20.12.2014
ZF005;died;25.11.2013;16.06.2015
ZF006;alive;25.11.2014;
ZF007;survived;02.12.2014;19.01.2015
ZF008;alive;11.12.2014;
ZF009;survived;28.01.2015;12.03.2015", sep=";")

df$startDate<-as.Date(df$startDate, "%d.%m.%Y")
df$endDate<-as.Date(df$endDate, "%d.%m.%Y")

What I need is the following: A new data frame containing how many days per year the proband was part of the study. It should look similar to this:

id     year days
ZF001  2012   77
ZF001  2013  365
ZF001  2014  365
ZF001  2015  135
etc.
like image 304
Johann Horvat Avatar asked Dec 25 '22 16:12

Johann Horvat


1 Answers

I'm assuming you want this only for the dead probands (as the live ones don't have an end date), here's a possible data.table solution which is pretty much self explanatory

library(data.table)
setDT(df)[census == "died", 
          as.data.table(table(year(seq.Date(startDate, endDate, by = "day")))), 
          by = id]
#       id   V1   N
# 1: ZF001 2012  77
# 2: ZF001 2013 365
# 3: ZF001 2014 365
# 4: ZF001 2015 136
# 5: ZF004 2013  51
# 6: ZF004 2014 354
# 7: ZF005 2013  37
# 8: ZF005 2014 365
# 9: ZF005 2015 167

Basically we calculate all the days from the start to the end date per id, then, we are using the year function in order to extract years, and then just calculating frequencies


Or an equivalent dplyr solution

library(dplyr)
df %>% 
  group_by(id) %>% 
  filter(census=='died') %>% 
  do(as.data.frame(table(year(seq.Date(.$startDate, .$endDate, by ='day')))))

Edit per comments: If you want this for all the patients (dead or alive), while for the alive ones you want to use Sys.Date, we could define a simple helper function in that case

dateFunc <- function(x, y){
  if(is.na(y)) {
    as.data.table(table(year(seq.Date(x, Sys.Date(), by = "day"))))                              
  } else as.data.table(table(year(seq.Date(x, y, by = "day"))))
}

setDT(df)[, setNames(dateFunc(startDate, endDate), c("Year", "Days")), by = id]
#        id Year Days
#  1: ZF001 2012   77
#  2: ZF001 2013  365
#  3: ZF001 2014  365
#  4: ZF001 2015  136
#  5: ZF002 2013   73
#  6: ZF002 2014  365
#  7: ZF002 2015  222
#  8: ZF003 2013   58
#  9: ZF003 2014  365
# 10: ZF003 2015  222
# 11: ZF004 2013   51
# 12: ZF004 2014  354
# 13: ZF005 2013   37
# 14: ZF005 2014  365
# 15: ZF005 2015  167
# 16: ZF006 2014   37
# 17: ZF006 2015  222
# 18: ZF007 2014   30
# 19: ZF007 2015   19
# 20: ZF008 2014   21
# 21: ZF008 2015  222
# 22: ZF009 2015   44

Data

df <- structure(list(id = structure(1:9, .Label = c("ZF001", "ZF002", 
"ZF003", "ZF004", "ZF005", "ZF006", "ZF007", "ZF008", "ZF009"
), class = "factor"), census = structure(c(2L, 1L, 1L, 2L, 2L, 
1L, 3L, 1L, 3L), .Label = c("alive", "died", "survived"), class = "factor"), 
    startDate = structure(c(15629, 15998, 16013, 16020, 16034, 
    16399, 16406, 16415, 16463), class = "Date"), endDate = structure(c(16571, 
    NA, NA, 16424, 16602, NA, 16454, NA, 16506), class = "Date")), .Names = c("id", 
"census", "startDate", "endDate"), row.names = c(NA, -9L), class = "data.frame")
like image 152
David Arenburg Avatar answered Jan 15 '23 06:01

David Arenburg