Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to subset the most recent 12 months of data for each ID in a data frame?

Tags:

date

r

subset

I have a data frame representing 15 years of follow-up data from several hundred patients. I want to create a subset of the data frame including the most recent 12 months of data for each patient.

Here is a representative example of my data (including one missing value, because missing data abound in my actual dataset):

# Create example dataset.
example.dat <- data.frame(
  ID = c(1,1,1,1,2,2,2,3,3,3), # patient ID numbers
  Date = as.Date(c("2000-02-01", "2004-10-21", "2005-02-06", # follow-up dates
                   "2005-06-14", "2002-11-24", "2009-03-05",
                   "2009-07-20", "2005-09-02", "2006-01-15",
                   "2006-05-18")),
  Cat = c("Yes", "Yes", "No", "Yes", "No", # responses to a categorical variable
          "Yes", "Yes", NA,   "No", "No")
  )

example.dat

Which yields the following output:

   ID       Date  Cat
1   1 2000-02-01  Yes
2   1 2004-10-21  Yes
3   1 2005-02-06   No
4   1 2005-06-14  Yes
5   2 2002-11-24   No
6   2 2009-03-05  Yes
7   2 2009-07-20  Yes
8   3 2005-09-02 <NA>
9   3 2006-01-15   No
10  3 2006-05-18   No

I need to figure out how to subset, for each ID number, the most recent record and all records from the previous 12 months.

   ID       Date  Cat
2   1 2004-10-21  Yes
3   1 2005-02-06   No
4   1 2005-06-14  Yes
6   2 2009-03-05  Yes
7   2 2009-07-20  Yes
8   3 2005-09-02 <NA>
9   3 2006-01-15   No
10  3 2006-05-18   No

Several questions have already been asked about subsetting by date in R, but they are generally concerned with subsetting data from a specific date or range of dates, not subsetting by ((variable end date) - (time interval)).

like image 523
Andrew T Avatar asked Jul 25 '14 15:07

Andrew T


3 Answers

For the sake of completeness, here are two data.table approaches using either subsetting by groups or a non-equi join. In addition, lubridate is used to ensure a period of 12 months is picked even in the case of leap years.

Subsetting by groups

This is essentialy the data.table version of docendo discimus' dplyr answer. However, lubridate functions are used for date arithmetic because simply subtracting 365 days will not cover a period of 12 months as requested by the OP in case the past year contains a leap day:

library(data.table)
library(lubridate)
setDT(example.dat)[, .SD[Date >= max(Date) %m-% years(1)], by = ID]
   ID       Date Cat
1:  1 2004-10-21 Yes
2:  1 2005-02-06  No
3:  1 2005-06-14 Yes
4:  2 2009-03-05 Yes
5:  2 2009-07-20 Yes
6:  3 2005-09-02  NA
7:  3 2006-01-15  No
8:  3 2006-05-18  No

Non-equi join

With version v1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to perform non-equi joins:

library(data.table)
library(lubridate)
mDT <- setDT(example.dat)[, max(Date) %m-% years(1), by = ID]
example.dat[example.dat[mDT, on = .(ID, Date >= V1), which = TRUE]]
   ID       Date Cat
1:  1 2004-10-21 Yes
2:  1 2005-02-06  No
3:  1 2005-06-14 Yes
4:  2 2009-03-05 Yes
5:  2 2009-07-20 Yes
6:  3 2005-09-02  NA
7:  3 2006-01-15  No
8:  3 2006-05-18  No

mDT contains the start dates of the 12 months period for each ID:

   ID         V1
1:  1 2004-06-14
2:  2 2008-07-20
3:  3 2005-05-18

The non-equi join returns the indices of the rows which fulfill the conditions

example.dat[mDT, on = .(ID, Date >= V1), which = TRUE]
[1]  2  3  4  6  7  8  9 10

which are then used to finally subset example.dat.

Comparison of date arithmetic methods

The answers posted so far employed three different methods to find a date 12 months earlier:

  • docendo discimus subtracts 365 days,
  • G. Grothendieck uses seq.Date(),
  • this answer uses years() and %m-%

The three methods differ in case a leap day is included in the period:

library(data.table)
library(lubridate)
mseq <- Vectorize(function(x) seq(x, length = 2L, by = "-1 year")[2L])
data.table(Date = as.Date("2016-02-28") + 0:2)[
  , minus_365d := Date -365][
    , minus_1yr := Date - years()][
      , minus_1yr_m := Date %m-% years()][
        , seq.Date := as_date(mseq(Date))][]
         Date minus_365d  minus_1yr minus_1yr_m   seq.Date
1: 2016-02-28 2015-02-28 2015-02-28  2015-02-28 2015-02-28
2: 2016-02-29 2015-03-01       <NA>  2015-02-28 2015-03-01
3: 2016-03-01 2015-03-02 2015-03-01  2015-03-01 2015-03-01
  • If there is no leap day in the past period, all three methods return the same result (row 1).
  • If a leap day is included in the past period, subtracting 365 days does not fully cover 12 months (row 3) as a leap year has 366 days.
  • If the reference date is a leap date, the seq.Date() approach picks the next day, 1 March 2015, as there is no 29 February in 2015. Using lubridate's %m-% rolls the date to the last day of February, 28 Feb 2015, instead.
like image 80
Uwe Avatar answered Nov 14 '22 01:11

Uwe


Here is a base solution. We have ave operate on dates as numbers since if we were to use raw "Date" values ave would try to return "Date" values. Instead, ave returns 0/1 values and !! converts those to FALSE/TRUE.

 in_last_yr <- function(x) {
    max_date <- as.Date(max(x), "1970-01-01")
    x > seq(max_date, length = 2, by = "-1 year")[2]
 }
 subset(example.dat, !!ave(as.numeric(Date), ID, FUN = in_last_yr))

Update Improved method of determining which days are in last year.

like image 41
G. Grothendieck Avatar answered Nov 14 '22 03:11

G. Grothendieck


A possible approach using dplyr

library(dplyr)

example.dat %>% group_by(ID) %>% filter(Date >= max(Date)-365)

#Source: local data frame [8 x 3]
#Groups: ID
#
#  ID       Date Cat
#1  1 2004-10-21 Yes
#2  1 2005-02-06  No
#3  1 2005-06-14 Yes
#4  2 2009-03-05 Yes
#5  2 2009-07-20 Yes
#6  3 2005-09-02  NA
#7  3 2006-01-15  No
#8  3 2006-05-18  No
like image 31
talat Avatar answered Nov 14 '22 01:11

talat