I have two dates (date1
and date2
) and an id
variable in a data.frame:
dat <- data.frame(c('2014-02-11', '2014-05-04', '2014-05-22'), c('2014-04-12', '2014-09-22', '2014-07-04'), c('a', 'a', 'b'))
names(dat) <- c('date1', 'date2', 'id')
dat$date1 <- as.character.Date(dat$date1, format = '%Y-%m-%d')
dat$date2 <- as.character.Date(dat$date2, format = '%Y-%m-%d')
> dat
date1 date2 id
1 2014-02-11 2014-04-12 a
2 2014-05-04 2014-09-22 a
3 2014-05-22 2014-07-04 b
I would like to create a new variable var
that indicates whether any date2
date value precedes the date1
date value for that row (not simply the date2
value immediately preceding it):
> dat
date1 date2 id var
1 2014-02-11 2014-04-12 a 0
2 2014-05-04 2014-09-22 a 1
3 2014-05-22 2014-07-04 b 0
I've been able to achieve this with the following loop:
ids <- as.vector(unique(unlist(dat$id)))
dat$var <- as.numeric(0)
for (i in ids) {
date2s <- as.vector(unlist(filter(dat, id == i)$date2))
for (j in date2s) {
dat <- dat %>% mutate(var = replace(var, (j < date1) & (id == i), 1)) # if any cdate precedes rdate
}
}
However, my data set is quite large, and I would like to achieve this using data.table
if possible, though I'm happy to approach this with dplyr
if there's an efficient approach.
Date objects in RDate objects are stored in R as integer values, allowing for dates to be compared and manipulated as you would a numeric vector. Logical comparisons are a simple. When referring to dates, earlier dates are “less than” later dates.
Convert Date to an R Date ClassYou need to convert your date column, which is currently stored as a character to a date class that can be displayed as a continuous variable. Lucky for us, R has a date class. You can convert the date field to a date class using the function as. Date() .
You can use the as. Date( ) function to convert character data to dates. The format is as. Date(x, "format"), where x is the character data and format gives the appropriate format.
A suggestion to use .EACHI
as follows after a self-join as suggested by @thelatemail
dat[dat, .(date1=i.date1, date2=i.date2, var=any(date2 < i.date1)), by=.EACHI, on=.(id)]
# id date1 date2 var
#1: a 2014-02-11 2014-04-12 FALSE
#2: a 2014-05-04 2014-09-22 TRUE
#3: b 2014-05-22 2014-07-04 FALSE
Edit: some timing for reference
set.seed(2L)
N <- 1e5
dat <- data.table(date1=sample(seq(as.Date("1970-01-01"), Sys.Date(), by="1 day"), N, replace=TRUE),
date2=sample(seq(as.Date("1970-01-01"), Sys.Date(), by="1 day"), N, replace=TRUE),
id=sample(letters, N, replace=TRUE))
dt1 <- copy(dat)
tlmMtd <- function() {
dt1[, rownum := .I]
dt1[dt1[dt1, on="id", rownum[i.date2 < date1], allow.cartesian=TRUE], hit := 1]
}
dt2 <- copy(dat)
csMtd <- function() dt2[dt2, .(date1=i.date1, date2=i.date2, var=any(date2 < i.date1)), by=.EACHI, on=.(id)]
dt3 <- copy(dat)
frankMtd <- function() dt3[, v := .SD[copy(.SD), on=.(id, date2 < date1), .N, by=.EACHI]$N > 0L]
microbenchmark::microbenchmark(
tlmMtd(),
csMtd(),
frankMtd(),
times=5L)
# Unit: milliseconds
# expr min lq mean median uq max neval
# tlmMtd() 18528.9799 18652.2217 23486.4213 19116.8014 21140.5923 39993.511 5
# csMtd() 3801.2146 3943.6201 4984.6274 5341.4322 5673.6878 6163.182 5
# frankMtd() 176.4477 177.5576 191.9636 178.9564 182.0311 244.825 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With