Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

No activity days detection using R

Tags:

date

datetime

r

I have a dataframe loaded in RStudio with information about numerous events (millions).

Each row is an entry of a single event and apart from other information it includes two attributes with date information. The first one contains the date when the event began and the second when it ended. But the events are not sequential so they might overlap in time.

                              fecha                   fecha_fin
7510607 2014-02-13 20:09:59.8270000 2014-02-27 09:55:40.9700000
7510608 2014-02-13 20:10:01.1870000 2014-02-27 09:55:42.5630000
7557931 2014-02-16 05:32:08.6230000 2014-02-16 14:03:19.4970000

What could be the best and most efficient option to find which calendar days had no activity (without any event in process)? Please, keep in my mind that the duration of the events must be taken into consideration.

like image 233
daloman Avatar asked Nov 16 '16 09:11

daloman


2 Answers

I tend to use foverlaps from the data.table package for such cases, e.g.:

library(data.table)
dt <- fread("id,fecha,fecha_fin
7510607,2014-02-01 20:09:59.8270000,2014-02-10 09:55:40.9700000
7510607,2014-02-13 20:09:59.8270000,2014-02-27 09:55:40.9700000
7510608,2014-02-13 20:10:01.1870000,2014-02-27 09:55:42.5630000
7557931,2014-02-16 05:32:08.6230000,2014-02-16 14:03:19.4970000")
setkey(dt, fecha, fecha_fin)
set(dt, j = 1L, value = NULL)
dt <- dt[,lapply(.SD, as.POSIXct, tz = "CET"),.SDcols=1:2]

dt2 <- data.table(fecha=as.POSIXct(seq(min(as.Date(dt$fecha)), max(as.Date(dt$fecha_fin)), "1 day")))[,fecha_fin:=fecha+60*60*24-1]
as.Date(foverlaps(dt2, dt)[is.na(fecha) & is.na(fecha_fin),i.fecha])
# [1] "2014-02-11" "2014-02-12"
like image 161
lukeA Avatar answered Oct 14 '22 06:10

lukeA


Update, with slightly modified code from lukeA:

I hope there is nothing wrong with my benchmarking here...

library(data.table)
library(lubridate)
library(microbenchmark)

# Create dt ---------------------------------------------------------------

size = 99999
# With this size result is an empty set, check smaller sizes like 999 to confirm
# results are same for both functions

create_dt <- function() {
  set.seed(2016)
  dt <- data.table(
    ID = 1:size,
    fecha = sample(
      seq(ymd('2000/01/01'), ymd('2016/11/16'), by="day"),
      size, replace = TRUE)
  )
  dt[, fecha_fin := fecha + sample(1:3, size, replace = TRUE)]
  setkey(dt, fecha, fecha_fin)
  set(dt, j = 1L, value = NULL)
  dt <- dt[,lapply(.SD, as.POSIXct, tz = "CET"),.SDcols=1:2]
}

dt <- create_dt()

# Declare functions -------------------------------------------------------

f_mdz <- function() {
  dt_2 <- data.table(
    fecha = seq(min(dt$fecha), max(dt$fecha_fin), by = '1 day')
  # Function simplified here!!!
  )[, fecha_fin := fecha]
  # ---------------------------
  as.Date(
    foverlaps(dt_2, dt)[is.na(fecha) & is.na(fecha_fin),i.fecha])#,
    # origin = '1970-01-01')
}

f_lukeA <- function() {
  dt2 <- data.table(
    fecha = seq(min(dt$fecha), max(dt$fecha_fin), "1 day")
  )[,fecha_fin:=fecha+60*60*24-1]
  as.Date(
    foverlaps(dt2, dt)[is.na(fecha) & is.na(fecha_fin),i.fecha])
}

# Benchmark! --------------------------------------------------------------

microbenchmark(
  dt_mdz <- f_mdz(),
  dt_lukeA <- f_lukeA(),
  times = 100)

# Unit: milliseconds
#                  expr      min       lq      mean   median       uq      max neval cld
#     dt_mdz <- f_mdz() 46.96793 55.11631  95.59214 60.33659 191.5536 212.4523   100   a
# dt_lukeA <- f_lukeA() 50.57496 56.42464 105.07356 60.81974 194.0779 211.8037   100   a

identical(dt_mdz, dt_lukeA)

Old answer here:

A point of departure (far from being efficient, e.g. row-wise operations on data.table...) for further investigation could be:

library(data.table)
library(lubridate)
library(magrittr)

dt <- data.table(
  ID = c(7510607L, 7510608L, 7557931L),
  fecha = ymd(c('2014-02-15', '2014-02-16', '2014-02-11')),
  fecha_fin = ymd(c('2014-02-27', '2014-02-27', '2014-02-12'))
)
#         ID      fecha  fecha_fin
# 1: 7510607 2014-02-15 2014-02-27
# 2: 7510608 2014-02-16 2014-02-27
# 3: 7557931 2014-02-11 2014-02-12

# Make the data "long"
long_dt <- dt[, .(days = seq(fecha, fecha_fin, by = '1 day')), by = ID]

# Get the diff with days sequence from min to max date
setdiff(
  seq(long_dt[, min(days)], long_dt[, max(days)], by = '1 day'),
  long_dt[, sort(unique(days))]
) %>% as.Date(origin = '1970-01-01')
# [1] "2014-02-13" "2014-02-14"

Please note I have changed your data to actually have two days (2014-02-13 and 2014-02-14) without any activity.

like image 1
m-dz Avatar answered Oct 14 '22 05:10

m-dz