Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R, how do I split timestamp interval data into regular slots?

Tags:

r

I'm working with data describing events having start and end time. For instance, it could be in a form of:

enter image description here

I'd like to convert this data to a form where I could count occurrences of events in regular intervals in order to be able to answer question of how many events were taking place from 13:15:00 to 13:29:59 and 13:30:00 to 13:45:00 and so on. In the example above the second record could be counted in first slot simply using start time, but wouldn't be counted as 'ongoing' in the second regular interval, even when end time was integrated.

To work with 15 min intervals I came up with a clumsy solution that uses tidyr::uncount to 'expand' dataset to 24*4=96 15 min intervals through the 24h period and then filtering ones that are inside specified intervals.

library(tidyverse)
library(lubridate)
library(magrittr)

df1 <- tibble::tibble(
  id = c(1, 2),
  start_date = c(ymd_hms("2018-12-10 14:45:51", tz = "Australia/Brisbane"), 
                 ymd_hms("2018-12-10 13:29:37", tz = "Australia/Brisbane")),
  end_date = c(ymd_hms("2018-12-10 14:59:04", tz = "Australia/Brisbane"),
               ymd_hms("2018-12-10 14:02:37", tz = "Australia/Brisbane")))

df2 <- df1 %>% 
  mutate(episode = 96) %>% 
  tidyr::uncount(episode, .id = "sequence")

df2$int_start <- rep(
  seq(ymd_hms("2018-12-10 00:00:00", tz = "Australia/Brisbane"), 
      ymd_hms("2018-12-10 23:59:59", tz = "Australia/Brisbane"), 
      by = "15 mins"),
  2)

df2$int_end <- df2$int_start + 899

df2 %<>% 
  filter(int_end > start_date & int_start < end_date )

That gives me data in desired format:

enter image description here

But I have a feeling there must be smarter way to do that. Expansion for instance will not be great solution for very large dataset and/or when the time interval is small (I think). It'll probably be also a pain to extend it to work with multiple days (I think).

In Stata one could also use stsplit command to accomplish something similar. I tried tinkering with survSplit from survival package but ended up with lots of records as well:

df1$status <- 1
df1$start_date <- as.numeric(df1$start_date)
df1$end_date <- as.numeric(df1$end_date)

df3 <- survSplit(Surv(end_date, status) ~., df1,
                 cut=seq(from=as.numeric(as.POSIXct("2018-12-10 00:00:00")), 
                         to=as.numeric(as.POSIXct("2018-12-10 00:00:00")) + 24*60*60, 
                         by=900), 
                 start = "start_int",
                 id="new_id",
                 episode ="episode")

df3$start_int <- as.POSIXct(df3$start_int, origin = "1970-01-01", tz = "Australia/Brisbane")
df3$start_date <- as.POSIXct(df3$start_date, origin = "1970-01-01", tz = "Australia/Brisbane")
df3$end_date <- as.POSIXct(df3$end_date, origin = "1970-01-01", tz = "Australia/Brisbane")

Any pointers to a better way to achieve such task?

like image 538
radek Avatar asked Jan 02 '19 03:01

radek


1 Answers

You might also approach this by thinking of each start_time as adding one active event and each end_time as reducing active events by one. This approach lets you identify the active events at any given instant, and it scales well. (I've used something similar to count millions of events and it's basically instantaneous.)

df2 <- df1 %>%
  gather(type, time, start_date:end_date) %>%
  mutate(event_chg = if_else(type == "start_date", 1, -1)) %>%
  arrange(time) %>%
  mutate(active_events = cumsum(event_chg))

df2
# A tibble: 4 x 5
#     id type       time                event_chg active_events
#  <dbl> <chr>      <dttm>                  <dbl>         <dbl>
#1     2 start_date 2018-12-10 13:29:37         1             1
#2     2 end_date   2018-12-10 14:02:37        -1             0
#3     1 start_date 2018-12-10 14:45:51         1             1
#4     1 end_date   2018-12-10 14:59:04        -1             0

ggplot(df2, aes(time, active_events)) + geom_step()

enter image description here

If you want to also assess the active count at regular intervals, you could integrate those intervals into your output data frame like this:

df2b <- df1 %>%
  gather(type, time, start_date:end_date) %>%
  mutate(event_chg = if_else(type == "start_date", 1, -1)) %>%
  #  NEW SECTION HERE
  bind_rows(data_frame(type = "marker",
               time = seq.POSIXt(ymd_h(2018121013, tz = "Australia/Brisbane"), 
                                 ymd_h(2018121016, tz = "Australia/Brisbane"), 
                                 by  = 15*60), # 15 minutes of seconds = 15*60
               event_chg = 0)) %>% 
  #  END OF NEW SECTION
  arrange(time) %>%
  mutate(active_events = cumsum(event_chg))

Then it's possible to plot those counts directly, or filter the output data frame to see them. In this case, event id 1 occurred entirely between two 15-minute intervals.

ggplot(df2b, aes(time, active_events, label = active_events)) + 
  geom_step() +
  geom_point(data = df2b %>% filter(type == "marker")) +
  geom_text(data = df2b %>% filter(type == "marker"), vjust = -0.5)

enter image description here

like image 199
Jon Spring Avatar answered Oct 01 '22 22:10

Jon Spring