I have a data frame representing 15 years of follow-up data from several hundred patients. I want to create a subset of the data frame including the most recent 12 months of data for each patient.
Here is a representative example of my data (including one missing value, because missing data abound in my actual dataset):
# Create example dataset.
example.dat <- data.frame(
ID = c(1,1,1,1,2,2,2,3,3,3), # patient ID numbers
Date = as.Date(c("2000-02-01", "2004-10-21", "2005-02-06", # follow-up dates
"2005-06-14", "2002-11-24", "2009-03-05",
"2009-07-20", "2005-09-02", "2006-01-15",
"2006-05-18")),
Cat = c("Yes", "Yes", "No", "Yes", "No", # responses to a categorical variable
"Yes", "Yes", NA, "No", "No")
)
example.dat
Which yields the following output:
ID Date Cat
1 1 2000-02-01 Yes
2 1 2004-10-21 Yes
3 1 2005-02-06 No
4 1 2005-06-14 Yes
5 2 2002-11-24 No
6 2 2009-03-05 Yes
7 2 2009-07-20 Yes
8 3 2005-09-02 <NA>
9 3 2006-01-15 No
10 3 2006-05-18 No
I need to figure out how to subset, for each ID number, the most recent record and all records from the previous 12 months.
ID Date Cat
2 1 2004-10-21 Yes
3 1 2005-02-06 No
4 1 2005-06-14 Yes
6 2 2009-03-05 Yes
7 2 2009-07-20 Yes
8 3 2005-09-02 <NA>
9 3 2006-01-15 No
10 3 2006-05-18 No
Several questions have already been asked about subsetting by date in R, but they are generally concerned with subsetting data from a specific date or range of dates, not subsetting by ((variable end date) - (time interval)).
For the sake of completeness, here are two data.table
approaches using either subsetting by groups or a non-equi join. In addition, lubridate
is used to ensure a period of 12 months is picked even in the case of leap years.
This is essentialy the data.table
version of docendo discimus' dplyr
answer. However, lubridate
functions are used for date arithmetic because simply subtracting 365 days will not cover a period of 12 months as requested by the OP in case the past year contains a leap day:
library(data.table)
library(lubridate)
setDT(example.dat)[, .SD[Date >= max(Date) %m-% years(1)], by = ID]
ID Date Cat 1: 1 2004-10-21 Yes 2: 1 2005-02-06 No 3: 1 2005-06-14 Yes 4: 2 2009-03-05 Yes 5: 2 2009-07-20 Yes 6: 3 2005-09-02 NA 7: 3 2006-01-15 No 8: 3 2006-05-18 No
With version v1.9.8 (on CRAN 25 Nov 2016), data.table
has gained the ability to perform non-equi joins:
library(data.table)
library(lubridate)
mDT <- setDT(example.dat)[, max(Date) %m-% years(1), by = ID]
example.dat[example.dat[mDT, on = .(ID, Date >= V1), which = TRUE]]
ID Date Cat 1: 1 2004-10-21 Yes 2: 1 2005-02-06 No 3: 1 2005-06-14 Yes 4: 2 2009-03-05 Yes 5: 2 2009-07-20 Yes 6: 3 2005-09-02 NA 7: 3 2006-01-15 No 8: 3 2006-05-18 No
mDT
contains the start dates of the 12 months period for each ID
:
ID V1 1: 1 2004-06-14 2: 2 2008-07-20 3: 3 2005-05-18
The non-equi join returns the indices of the rows which fulfill the conditions
example.dat[mDT, on = .(ID, Date >= V1), which = TRUE]
[1] 2 3 4 6 7 8 9 10
which are then used to finally subset example.dat
.
The answers posted so far employed three different methods to find a date 12 months earlier:
seq.Date()
,years()
and %m-%
The three methods differ in case a leap day is included in the period:
library(data.table)
library(lubridate)
mseq <- Vectorize(function(x) seq(x, length = 2L, by = "-1 year")[2L])
data.table(Date = as.Date("2016-02-28") + 0:2)[
, minus_365d := Date -365][
, minus_1yr := Date - years()][
, minus_1yr_m := Date %m-% years()][
, seq.Date := as_date(mseq(Date))][]
Date minus_365d minus_1yr minus_1yr_m seq.Date 1: 2016-02-28 2015-02-28 2015-02-28 2015-02-28 2015-02-28 2: 2016-02-29 2015-03-01 <NA> 2015-02-28 2015-03-01 3: 2016-03-01 2015-03-02 2015-03-01 2015-03-01 2015-03-01
no
leap day in the past period, all three methods return the same result (row 1).seq.Date()
approach picks the next day, 1 March 2015, as there is no 29 February in 2015. Using lubridate
's %m-%
rolls the date to the last day of February, 28 Feb 2015, instead.Here is a base solution. We have ave
operate on dates as numbers since if we were to use raw "Date"
values ave
would try to return "Date"
values. Instead, ave
returns 0/1 values and !!
converts those to FALSE/TRUE.
in_last_yr <- function(x) {
max_date <- as.Date(max(x), "1970-01-01")
x > seq(max_date, length = 2, by = "-1 year")[2]
}
subset(example.dat, !!ave(as.numeric(Date), ID, FUN = in_last_yr))
Update Improved method of determining which days are in last year.
A possible approach using dplyr
library(dplyr)
example.dat %>% group_by(ID) %>% filter(Date >= max(Date)-365)
#Source: local data frame [8 x 3]
#Groups: ID
#
# ID Date Cat
#1 1 2004-10-21 Yes
#2 1 2005-02-06 No
#3 1 2005-06-14 Yes
#4 2 2009-03-05 Yes
#5 2 2009-07-20 Yes
#6 3 2005-09-02 NA
#7 3 2006-01-15 No
#8 3 2006-05-18 No
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With