Is there some way to use rollapply (from zoo
package or something similar) optimized functions (rollmean
, rollmedian
etc) to compute rolling functions with a time-based window, instead of one based on a number of observations? What I want is simple: for each element in an irregular time series, I want to compute a rolling function with a N-days window. That is, the window should include all the observations up to N days before the current observation. Time series may also contain duplicates.
Here follows an example. Given the following time series:
date value
1/11/2011 5
1/11/2011 4
1/11/2011 2
8/11/2011 1
13/11/2011 0
14/11/2011 0
15/11/2011 0
18/11/2011 1
21/11/2011 4
5/12/2011 3
A rolling median with a 5-day window, aligned to the right, should result in the following calculation:
> c(
median(c(5)),
median(c(5,4)),
median(c(5,4,2)),
median(c(1)),
median(c(1,0)),
median(c(0,0)),
median(c(0,0,0)),
median(c(0,0,0,1)),
median(c(1,4)),
median(c(3))
)
[1] 5.0 4.5 4.0 1.0 0.5 0.0 0.0 0.0 2.5 3.0
I already found some solutions out there but they are usually tricky, which usually means slow. I managed to implement my own rolling function calculation. The problem is that for very long time series the optimized version of median (rollmedian) can make a huge time difference, since it takes into account the overlap between windows. I would like to avoid reimplementing it. I suspect there are some trick with rollapply parameters that will make it work, but I cannot figure it out. Thanks in advance for the help.
As of version v1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to perform non-equi joins which can be used here.
The OP has requested
for each element in an irregular time series, I want to compute a rolling function with a N-days window. That is, the window should include all the observations up to N days before the current observation. Time series may also contain duplicates.
Note that the OP has requested to include all the observations up to N days before the current observation. This is different to request all the observations up to N days before the current day.
For the latter, I would expect one value for 1/11/2011
, i.e., median(c(5, 4, 2))
= 4.
Apparently, the OP expects an observation-based rolling window which is limited to N days. Therefore, the join conditions of the non-equi join have to consider the row number as well.
library(data.table)
n_days <- 5L
setDT(DT)[, rn := .I][
.(ur = rn, ud = date, ld = date - n_days),
on = .(rn <= ur, date <= ud, date >= ld),
median(as.double(value)), by = .EACHI]$V1
[1] 5.0 4.5 4.0 1.0 0.5 0.0 0.0 0.0 2.5 3.0
For the sake of completeness, a possible solution for the day-based rolling window could be:
setDT(DT)[.(ud = unique(date), ld = unique(date) - n_days), on = .(date <= ud, date >= ld),
median(as.double(value)), by = .EACHI]
date date V1 1: 2011-11-01 2011-10-27 4.0 2: 2011-11-08 2011-11-03 1.0 3: 2011-11-13 2011-11-08 0.5 4: 2011-11-14 2011-11-09 0.0 5: 2011-11-15 2011-11-10 0.0 6: 2011-11-18 2011-11-13 0.0 7: 2011-11-21 2011-11-16 2.5 8: 2011-12-05 2011-11-30 3.0
library(data.table)
DT <- fread(" date value
1/11/2011 5
1/11/2011 4
1/11/2011 2
8/11/2011 1
13/11/2011 0
14/11/2011 0
15/11/2011 0
18/11/2011 1
21/11/2011 4
5/12/2011 3")[
# coerce date from character string to integer date class
, date := as.IDate(date, "%d/%m/%Y")]
1) rollapply Haven't check the speed but if no date has more than max.dup
occurences then it must be that the last 5 * max.dup entries contain the last 5 days so the one-line function fn
shown below passed to rollapplyr
will do it:
k <- 5
dates <- as.numeric(DF$date)
values <- DF$value
max.dup <- max(table(dates))
fn <- function(ix, d = dates[ix], v = values[ix], n = length(ix)) median(v[d >= d[n]-k])
rollapplyr(1:nrow(DF), max.dup * k, fn, partial = TRUE)
## [1] 5.0 4.5 4.0 1.0 0.5 0.0 0.0 0.0 2.5 3.0
2) sqldf We can use an SQL self-join to do this. We join to each a
row those b
rows no more than 5 days back and then group by the a
row taking the median of the b
rows joined to it.
library(sqldf)
k <- 5
res <- fn$sqldf("select a.date, a.value, median(b.value) median
from DF a
left join DF b on b.date between a.date - $k and a.date and b.rowid <= a.rowid
group by a.rowid")
giving:
res$median
## [1] 5.0 4.5 4.0 1.0 0.5 0.0 0.0 0.0 2.5 3.0
Note: We used this for DF
:
Lines <- "
date value
1/11/2011 5
1/11/2011 4
1/11/2011 2
8/11/2011 1
13/11/2011 0
14/11/2011 0
15/11/2011 0
18/11/2011 1
21/11/2011 4
5/12/2011 3
"
DF <- read.table(text = Lines, header = TRUE)
DF$date <- as.Date(DF$date, format = "%d/%m/%Y")
I recommend using runner package which is optimized to do operation requested in this topic. Go to section Windows depending on date in documentation, for further explanation.
To solve your task, one can use runner
function which can execute any R function in running windows. One-liner here:
df <- read.table(
text = "date value
2011-11-01 5
2011-11-01 4
2011-11-01 2
2011-11-08 1
2011-11-13 0
2011-11-14 0
2011-11-15 0
2011-11-18 1
2011-11-21 4
2011-12-05 3", header = TRUE, colClasses = c("Date", "integer"))
library(runner)
runner(df$value, k = 5, idx = df$date, f = median)
[1] 5.0 4.5 4.0 1.0 0.0 0.0 0.0 0.0 2.5 3.0
P.S. one should be aware that 5-days window is [i-4, i-3, i-2, i-1, i]
instead of (i-5):i
(6-days window). Illustration below for better explanation of the concept.
I've made example on 5-days window but if one want to reproduce result as OP requested, can specify 6-days window:
identical(
runner(df$value, k = 6, idx = df$date, f = median),
c(5.0, 4.5, 4.0, 1.0, 0.5, 0.0, 0.0, 0.0, 2.5, 3.0)
)
# [1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With