I have a dataframe loaded in RStudio with information about numerous events (millions).
Each row is an entry of a single event and apart from other information it includes two attributes with date information. The first one contains the date when the event began and the second when it ended. But the events are not sequential so they might overlap in time.
fecha fecha_fin
7510607 2014-02-13 20:09:59.8270000 2014-02-27 09:55:40.9700000
7510608 2014-02-13 20:10:01.1870000 2014-02-27 09:55:42.5630000
7557931 2014-02-16 05:32:08.6230000 2014-02-16 14:03:19.4970000
What could be the best and most efficient option to find which calendar days had no activity (without any event in process)? Please, keep in my mind that the duration of the events must be taken into consideration.
I tend to use foverlaps
from the data.table
package for such cases, e.g.:
library(data.table)
dt <- fread("id,fecha,fecha_fin
7510607,2014-02-01 20:09:59.8270000,2014-02-10 09:55:40.9700000
7510607,2014-02-13 20:09:59.8270000,2014-02-27 09:55:40.9700000
7510608,2014-02-13 20:10:01.1870000,2014-02-27 09:55:42.5630000
7557931,2014-02-16 05:32:08.6230000,2014-02-16 14:03:19.4970000")
setkey(dt, fecha, fecha_fin)
set(dt, j = 1L, value = NULL)
dt <- dt[,lapply(.SD, as.POSIXct, tz = "CET"),.SDcols=1:2]
dt2 <- data.table(fecha=as.POSIXct(seq(min(as.Date(dt$fecha)), max(as.Date(dt$fecha_fin)), "1 day")))[,fecha_fin:=fecha+60*60*24-1]
as.Date(foverlaps(dt2, dt)[is.na(fecha) & is.na(fecha_fin),i.fecha])
# [1] "2014-02-11" "2014-02-12"
Update, with slightly modified code from lukeA:
I hope there is nothing wrong with my benchmarking here...
library(data.table)
library(lubridate)
library(microbenchmark)
# Create dt ---------------------------------------------------------------
size = 99999
# With this size result is an empty set, check smaller sizes like 999 to confirm
# results are same for both functions
create_dt <- function() {
set.seed(2016)
dt <- data.table(
ID = 1:size,
fecha = sample(
seq(ymd('2000/01/01'), ymd('2016/11/16'), by="day"),
size, replace = TRUE)
)
dt[, fecha_fin := fecha + sample(1:3, size, replace = TRUE)]
setkey(dt, fecha, fecha_fin)
set(dt, j = 1L, value = NULL)
dt <- dt[,lapply(.SD, as.POSIXct, tz = "CET"),.SDcols=1:2]
}
dt <- create_dt()
# Declare functions -------------------------------------------------------
f_mdz <- function() {
dt_2 <- data.table(
fecha = seq(min(dt$fecha), max(dt$fecha_fin), by = '1 day')
# Function simplified here!!!
)[, fecha_fin := fecha]
# ---------------------------
as.Date(
foverlaps(dt_2, dt)[is.na(fecha) & is.na(fecha_fin),i.fecha])#,
# origin = '1970-01-01')
}
f_lukeA <- function() {
dt2 <- data.table(
fecha = seq(min(dt$fecha), max(dt$fecha_fin), "1 day")
)[,fecha_fin:=fecha+60*60*24-1]
as.Date(
foverlaps(dt2, dt)[is.na(fecha) & is.na(fecha_fin),i.fecha])
}
# Benchmark! --------------------------------------------------------------
microbenchmark(
dt_mdz <- f_mdz(),
dt_lukeA <- f_lukeA(),
times = 100)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dt_mdz <- f_mdz() 46.96793 55.11631 95.59214 60.33659 191.5536 212.4523 100 a
# dt_lukeA <- f_lukeA() 50.57496 56.42464 105.07356 60.81974 194.0779 211.8037 100 a
identical(dt_mdz, dt_lukeA)
Old answer here:
A point of departure (far from being efficient, e.g. row-wise operations on data.table
...) for further investigation could be:
library(data.table)
library(lubridate)
library(magrittr)
dt <- data.table(
ID = c(7510607L, 7510608L, 7557931L),
fecha = ymd(c('2014-02-15', '2014-02-16', '2014-02-11')),
fecha_fin = ymd(c('2014-02-27', '2014-02-27', '2014-02-12'))
)
# ID fecha fecha_fin
# 1: 7510607 2014-02-15 2014-02-27
# 2: 7510608 2014-02-16 2014-02-27
# 3: 7557931 2014-02-11 2014-02-12
# Make the data "long"
long_dt <- dt[, .(days = seq(fecha, fecha_fin, by = '1 day')), by = ID]
# Get the diff with days sequence from min to max date
setdiff(
seq(long_dt[, min(days)], long_dt[, max(days)], by = '1 day'),
long_dt[, sort(unique(days))]
) %>% as.Date(origin = '1970-01-01')
# [1] "2014-02-13" "2014-02-14"
Please note I have changed your data to actually have two days (2014-02-13 and 2014-02-14) without any activity.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With