Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr: grouping and summarizing/mutating data with rolling time windows

I have irregular timeseries data representing a certain type of transaction for users. Each line of data is timestamped and represents a transaction at that time. By the irregular nature of the data some users might have 100 rows in a day and other users might have 0 or 1 transaction in a day.

The data might look something like this:

data.frame(
  id = c(1, 1, 1, 1, 1, 2, 2, 3, 4),
  date = c("2015-01-01", 
           "2015-01-01", 
           "2015-01-05", 
           "2015-01-25",
           "2015-02-15",
           "2015-05-05", 
           "2015-01-01", 
           "2015-08-01", 
           "2015-01-01"),
  n_widgets = c(1,2,3,4,4,5,2,4,5)
)

   id       date n_widgets
1  1 2015-01-01         1
2  1 2015-01-01         2
3  1 2015-01-05         3
4  1 2015-01-25         4
5  1 2015-02-15         4
6  2 2015-05-05         5
7  2 2015-01-01         2
8  3 2015-08-01         4
9  4 2015-01-01         5

Often I'd like to know some rolling statistics about users. For example: for this user on a certain day, how many transactions occurred in the previous 30 days, how many widgets were sold in the previous 30 days etc.

Corresponding to the above example, the data should look like:

   id     date    n_widgets  n_trans_30  total_widgets_30
1  1 2015-01-01         1           1             1
2  1 2015-01-01         2           2             3
3  1 2015-01-05         3           3             6
4  1 2015-01-25         4           4             10
5  1 2015-02-15         4           2             8
6  2 2015-05-05         5           1             5
7  2 2015-01-01         2           1             2
8  3 2015-08-01         4           1             4
9  4 2015-01-01         5           1             5

If the time window is daily then the solution is simple: data %>% group_by(id, date) %>% summarize(...)

Similarly if the time window is monthly this is also relatively simple with lubridate: data %>% group_by(id, year(date), month(date)) %>% summarize(...)

However the challenge I'm having is how to setup a time window for an arbitrary period: 5-days, 10-days etc.

There's also the RcppRoll library but both RcppRoll and the rolling functions in zoo seem more setup for regular time series. As far as I can tell these window functions work based on the number of rows instead of a specified time period -- the key difference is that a certain time period might have a differing number of rows depending on date and user.

For example, it's possible for user 1, that the number of transactions in the 5 days previous of 2015-01-01 is equal to 100 transactions and for the same user the number of transactions in the 5 days previous of 2015-02-01 is equal to 5 transactions. Thus looking back a set number of rows will simply not work.

Additionally, there is another SO thread discussing rolling dates for irregular time series type data (Create new column based on condition that exists within a rolling date) however the accepted solution was using data.table and I'm specifically looking for a dplyr way of achieving this.

I suppose at the heart of this issue, this problem can be solved by answering this question: how can I group_by arbitrary time periods in dplyr. Alternatively, if there's a different dplyr way to achieve above without a complicated group_by, how can I do it?

EDIT: updated example to make nature of the rolling window more clear.

like image 794
divide_by_zero Avatar asked Mar 23 '16 20:03

divide_by_zero


2 Answers

This can be done using SQL:

library(sqldf)

dd <- transform(data, date = as.Date(date))
sqldf("select a.*, count(*) n_trans30, sum(b.n_widgets) 'total_widgets30' 
       from dd a 
       left join dd b on b.date between a.date - 30 and a.date 
                         and b.id = a.id
                         and b.rowid <= a.rowid
       group by a.rowid")

giving:

  id       date n_widgets n_trans30 total_widgets30
1  1 2015-01-01         1         1               1
2  1 2015-01-01         2         2               3
3  1 2015-01-05         3         3               6
4  1 2015-01-25         4         4              10
5  2 2015-05-05         5         1               5
6  2 2015-01-01         2         1               2
7  3 2015-08-01         4         1               4
8  4 2015-01-01         5         1               5
like image 68
G. Grothendieck Avatar answered Nov 16 '22 04:11

G. Grothendieck


Another approach is to expand your dataset to contain all possible days (using tidyr::complete), then use a rolling function (RcppRoll::roll_sum)

The fact that you have multiple observations per day is probably creating an issue though...

library(tidyr)
library(RcppRoll)
df2 <- df %>%
   mutate(date=as.Date(date))

## create full dataset with all possible dates (go even 30 days back for first observation)
df_full<- df2 %>%
 mutate(date=as.Date(date))  %>%
   complete(id, 
       date=seq(from=min(.$date)-30,to=max(.$date), by=1), 
       fill=list(n_widgets=0))

## now use rolling function, and keep only original rows (left join)
df_roll <- df_full %>%
  group_by(id) %>%
  mutate(n_trans_30=roll_sum(x=n_widgets!=0, n=30, fill=0, align="right"),
         total_widgets_30=roll_sum(x=n_widgets, n=30, fill=0, align="right")) %>%
  ungroup() %>%
  right_join(df2, by = c("date", "id", "n_widgets"))

The result is the same as yours (by chance)

     id       date n_widgets n_trans_30 total_widgets_30
  <dbl>     <date>     <dbl>      <dbl>            <dbl>
1     1 2015-01-01         1          1                1
2     1 2015-01-01         2          2                3
3     1 2015-01-05         3          3                6
4     1 2015-01-25         4          4               10
5     1 2015-02-15         4          2                8
6     2 2015-05-05         5          1                5
7     2 2015-01-01         2          1                2
8     3 2015-08-01         4          1                4
9     4 2015-01-01         5          1                5

But as said, it will fail for some days as it count last 30 obs, not last 30 days. So you might want first to summarise the information by day, then apply this.

like image 40
Matifou Avatar answered Nov 16 '22 03:11

Matifou