Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Chunk continuous timeseries data into non-continuous time windows for multiple time periods and multiple groups

I have two datasets: df1 contains windows of time that represent peak activity by id. These are non-continous time series, and there are multiple windows (events) per id, i.e. each id has multiple peak activity periods. Below is a reproducible example I've made up but is not the real data (NOTE: I updated the data according to the comments below).

df1<-data.frame(start_date=seq(as.POSIXct("2014-09-04 00:00:00"), by = "hour", length.out = 10),
                end_date=seq(as.POSIXct("2014-09-04 05:00:00"), by = "hour", length.out = 10),
                             values=runif(20,10,50),id=rep(seq(from=1,to=5,by=1),2))

df2 is a set of continuous timeseries of activity by id. I want to subset the date.date for each entry/peak activity in df1 (by id).

date1<-data.frame(date=seq(as.POSIXct("2012-09-04 02:00:00"), by = "hour", length.out = 20), id=1)
date2<-data.frame(date=seq(as.POSIXct("2014-09-03 07:00:00"), by = "hour", length.out = 20),id=2)
date3<-data.frame(date=seq(as.POSIXct("2014-09-04 01:00:00"), by = "hour", length.out = 20),id=3)
df2<-data.frame(date=rbind(date1,date2,date3),values=runif(60,50,90))

The Goal: subset the continuous time series in df2 only between the start_time to end_time in df1 (by id), and keep the values field from each df. There is a somewhat similar question here, but in that case the time period was static and known. I am struggling with how to do this given the multiple events per id.

like image 295
LauraR Avatar asked May 04 '20 19:05

LauraR


2 Answers

data.table has the function foverlaps, which does what you need.

foverlaps stands for “fast overlap joins”. The function takes two data frames (data.tables in this case) and returns the join.

Both data.tables need a start and end columns to calculate the overlaps. Since you only have one date column in df2, I’m just creating a dummy_end column with the same date as date.date in df2.

You can use the options by.x and by.y to indicate the start and end columns. However, you can also use keys to do so using the setkey statement. The last two elements of the setkey must be the start and end columns. The advantage of using setkey is that you can add additional keys (before start and end) to filter the join further. In the present case, I’ll also set a key for id.

The [, dummy_end := NULL] is used to delete the dummy_end column.

library(data.table)
dt1 <- data.table(df1)
dt2 <- data.table(df2)
setnames(dt2,"date.id","id") #change name to "id" for easier comparison
dt2[, dummy_end := date.date] #create dumme end date column 
setkey(dt1, id, start_date, end_date)
setkey(dt2, id, date.date,  dummy_end)

foverlaps(dt2, dt1, nomatch = NULL)[, dummy_end := NULL]

In terms of performance, foverlaps is marginally faster than dplyr for this particular problem (but still slower than base R). Indeed, you can see below that I reran Paul's microbenchmark to add data.table. However, I like the clean and simple data.table syntax.

DATA and Benchmark

library(dplyr)
library(microbenchmark)
library(data.table)

df1 <- data.frame(start_date=seq(as.POSIXct("2014-09-04 00:00:00"), 
       by = "hour", length.out = 10),
       end_date=seq(as.POSIXct("2014-09-04 05:00:00"), 
       by = "hour", length.out = 10),
       values=runif(20,10,50),id=rep(seq(from=1,to=5,by=1),2))

date1 <-data.frame(date = seq(as.POSIXct("2012-09-04 02:00:00"), 
                              by = "hour", 
                              length.out = 20), id = 1)
date2 <-data.frame(date = seq(as.POSIXct("2014-09-03 07:00:00"), 
                              by = "hour", 
                              length.out = 20),id = 2)
date3 <-data.frame(date = seq(as.POSIXct("2014-09-04 01:00:00"), 
                              by = "hour", length.out = 20),id = 3)
df2 <-data.frame(date = rbind(date1,date2,date3), values = runif(60,50,90))

dt1 <- data.table(df1)
dt2 <- data.table(df2)
setnames(dt2,"date.id","id") #change name to "id" for easier comparison
dt2[, dummy_end := date.date] #create dumme end date column 
setkey(dt1, id, start_date, end_date)
setkey(dt2, id, date.date,  dummy_end)

dplyr2 <- function(df1, df2) {
  df <- left_join(df1, df2, by = c("id" = "date.id")) %>%
    group_by(id) %>%
    filter(date.date >= start_date &
             date.date <= end_date) %>%
    select(start_date,
           end_date,
           x_values = values.x,
           y_values = values.y,
           id,
           date.date) %>%
    ungroup()
}

baseR2 <- function(df1, df2) {
  df_bR <- merge(df1, df2, by.x = "id", by.y = "date.id")
  df_bR <- subset(
    df_bR,
    subset = df_bR$date.date >=  df_bR$start_date &
      df_bR$date.date <=  df_bR$end_date,
    select = c(start_date, end_date, values.x, values.y, id, date.date)
  )
}

data.table2 <- function(dt1, dt2) {
  foverlaps(dt2, dt1,nomatch = NULL)[, dummy_end := NULL]
}


microbenchmark(baseR = baseR2(df1, df2),
               dplyr = dplyr2(df1, df2),
               data.table=data.table2(dt1, dt2),
               times = 50)
Unit: milliseconds
       expr    min     lq     mean median     uq     max neval
      baseR 1.2328 1.3973 1.632302 1.4713 1.5596  7.0549    50
      dplyr 8.2126 8.6865 9.628708 8.8531 9.2621 19.5883    50
 data.table 6.6931 7.3884 7.974340 7.9406 8.3973 11.0060    50
like image 71
Pierre Lapointe Avatar answered Oct 22 '22 17:10

Pierre Lapointe


Your goal is not entirely clear to me but this is my reading: if the time (ignore date) in date.date is within start_date and end_date you would like to subset by Id.

Here's how I approached this:

library(dplyr)

df1<-data.frame(start_date=seq(as.POSIXct("2014-09-04 00:00:00"), by = "hour", length.out = 10),
                end_date=seq(as.POSIXct("2014-09-04 05:00:00"), by = "hour", length.out = 10),
                values=runif(20,10,50),id=rep(seq(from=1,to=5,by=1),2))

date1<-data.frame(date=seq(as.POSIXct("2012-10-01 00:00:00"), by = "hour", length.out = 20), id=1)
date2<-data.frame(date=seq(as.POSIXct("2014-10-01 07:00:00"), by = "hour", length.out = 20), id=2)
date3<-data.frame(date=seq(as.POSIXct("2015-10-01 01:00:00"), by = "hour", length.out = 20), id=3)
df2<-data.frame(date=rbind(date1,date2,date3),values=runif(60,50,90))

df <- left_join(df1, df2, by = c("id" = "date.id")) %>%
  mutate(date.date.hms = strftime(date.date, format = "%H:%M:%S"),
         start_date.hms = strftime(start_date, format = "%H:%M:%S"),
         end_date.hms = strftime(end_date, format = "%H:%M:%S")) %>%
  mutate(date.date.hms = as.POSIXct(date.date.hms, format="%H:%M:%S"),
         start_date.hms = as.POSIXct(start_date.hms, format="%H:%M:%S"),
         end_date.hms = as.POSIXct(end_date.hms, format="%H:%M:%S")) %>%
  group_by(id) %>% 
  filter(date.date.hms >= start_date.hms & date.date.hms <= end_date.hms) %>%
  select(start_date, end_date, x_values = values.x, y_values = values.y, id, date.date) %>%
  ungroup()

This results in the following data frame:

> df
# A tibble: 62 x 6
   start_date          end_date            x_values y_values    id date.date          
   <dttm>              <dttm>                 <dbl>    <dbl> <dbl> <dttm>             
 1 2014-09-04 00:00:00 2014-09-04 05:00:00     31.5     77.5     1 2012-10-01 00:00:00
 2 2014-09-04 00:00:00 2014-09-04 05:00:00     31.5     54.5     1 2012-10-01 01:00:00
 3 2014-09-04 00:00:00 2014-09-04 05:00:00     31.5     70.3     1 2012-10-01 02:00:00
 4 2014-09-04 00:00:00 2014-09-04 05:00:00     31.5     85.5     1 2012-10-01 03:00:00
 5 2014-09-04 00:00:00 2014-09-04 05:00:00     31.5     82.2     1 2012-10-01 04:00:00
 6 2014-09-04 00:00:00 2014-09-04 05:00:00     31.5     57.4     1 2012-10-01 05:00:00
 7 2014-09-04 01:00:00 2014-09-04 06:00:00     37.0     78.8     2 2014-10-02 01:00:00
 8 2014-09-04 01:00:00 2014-09-04 06:00:00     37.0     51.9     2 2014-10-02 02:00:00
 9 2014-09-04 02:00:00 2014-09-04 07:00:00     34.1     85.8     3 2015-10-01 02:00:00
10 2014-09-04 02:00:00 2014-09-04 07:00:00     34.1     69.4     3 2015-10-01 03:00:00

My approach was to first join the DFs by Id and then split the time information from the date (in the .hms columns) as a string and convert this back to a POSIXct object. This adds today's date to the time but that's ok if I just want to apply a filter on the time (and not the date). This results in a DF where the records have a date.date TIME within start_date and end_date. it is now easy to subset by Id column.

Is this what you are after?

UPDATE

LauraR explained that there is overlap in the dates in df1 & df2. She updated df1 & df2 in her example. With that update I can rewrite the code without the conversions of POSIXct to character and vice versa. it appears that as.POSIXct is a slow operation.

I can now do the following:

  • remove all the date-time conversions and only check if the date-time in df2 is within the date-time range of df1
  • rewrite the code in dplyr and baseR: we know that the pipe creates substantial overhead.
  • turn the code into functions so I can benchmark them.

With the code:

library(dplyr)
library(microbenchmark)

df1 <- data.frame(start_date=seq(as.POSIXct("2014-09-04 00:00:00"), by = "hour", length.out = 10),
                  end_date=seq(as.POSIXct("2014-09-04 05:00:00"), by = "hour", length.out = 10),
                  values=runif(20,10,50),id=rep(seq(from=1,to=5,by=1),2))

date1 <-data.frame(date = seq(as.POSIXct("2012-09-04 02:00:00"), 
                              by = "hour", 
                              length.out = 20), id = 1)
date2 <-data.frame(date = seq(as.POSIXct("2014-09-03 07:00:00"), 
                              by = "hour", 
                              length.out = 20),id = 2)
date3 <-data.frame(date = seq(as.POSIXct("2014-09-04 01:00:00"), 
                              by = "hour", l
                              ength.out = 20),id = 3)
df2 <-data.frame(date = rbind(date1,date2,date3), values = runif(60,50,90))

dplyr2 <- function(df1, df2) {
  df <- left_join(df1, df2, by = c("id" = "date.id")) %>%
    group_by(id) %>%
    filter(date.date >= start_date &
             date.date <= end_date) %>%
    select(start_date,
           end_date,
           x_values = values.x,
           y_values = values.y,
           id,
           date.date) %>%
    ungroup()
}

baseR2 <- function(df1, df2) {
  df_bR <- merge(df1, df2, by.x = "id", by.y = "date.id")
  df_bR <- subset(
    df_bR,
    subset = df_bR$date.date >=  df_bR$start_date &
      df_bR$date.date <=  df_bR$end_date,
    select = c(start_date, end_date, values.x, values.y, id, date.date)
  )
}

data_baseR <- baseR2(df1, df2)
data_dplyr <- dplyr2(df1, df2)

microbenchmark(baseR = baseR2(df1, df2),
               dplyr = dplyr2(df1, df2),
               times = 5)

This code is a lot faster than before and I am sure it will require less memory. Comparing between dplyr and baseR:

> data_baseR <- baseR2(df1, df2)
> microbenchmark(baseR = baseR2(df1, df2),
+                dplyr = dplyr2(df1, df2),
+                times = 5)
Unit: microseconds
  expr    min     lq    mean median     uq    max neval
 baseR  897.5  905.3 1868.66  991.2 1041.0 5508.3     5
 dplyr 5755.9 5970.2 6158.88 6277.4 6393.3 6397.6     5

shows that the baseR code runs a lot faster.

like image 44
Paul van Oppen Avatar answered Oct 22 '22 17:10

Paul van Oppen