Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sliding time intervals for time series data in R

I am trying to extract interesting statistics for an irregular time series data set, but coming up short on finding the right tools for the job. The tools for manipulating regularly sampled time series or index-based series of any time are pretty easily found, though I'm not having much luck with the problems I'm trying to solve.

First, a reproducible data set:

library(zoo)
set.seed(0)
nSamples    <- 5000
vecDT       <- rexp(nSamples, 3)
vecTimes    <- cumsum(c(0,vecDT))
vecDrift    <- c(0, rnorm(nSamples, mean = 1/nSamples, sd = 0.01))
vecVals     <- cumsum(vecDrift)
vecZ        <- zoo(vecVals, order.by = vecTimes)
rm(vecDT, vecDrift)

Assume the times are in seconds. There are almost 1700 seconds (just shy of 30 minutes) in the vecZ series, and 5001 entries during that time. (NB: I'd try using xts, but xts seems to need date information, and I'd rather not use a particular date when it's not relevant.)

My goals are the following:

  • Identify the indices of the values 3 minutes before and 3 minutes after each point. As the times are continuous, I doubt that any two points are precisely 3 minutes apart. What I'd like to find are the points that are at most 3 minutes prior, and at least 3 minutes after, the given point, i.e. something like the following (in pseudocode):

    backIX(t, vecZ, tDelta) = min{ix in length(vecZ) : t - time(ix) < tDelta} forwardIX(t, vecZ, tDelta) = min{ix in length(vecZ) : time(ix) - t > tDelta}

    So, for 3 minutes, tDelta = 180. If t=2500, then the result for forwardIX() would be 3012 (i.e. time(vecZ)[2500] is 860.1462, and time(vecZ)[3012] is 1040.403, or just over 180 seconds later), and the output of backwardIX() would be 2020 (corresponding to time 680.7162 seconds).

    Ideally, I would like to use a function that does not require t, as that is going to require length(vecZ) calls to the function, which ignores the fact that sliding windows of time can be calculated more efficiently.

  • Apply a function to all values in a rolling window of time. I've seen rollapply, which takes a fixed window size (i.e. fixed number of indices, but not a fixed window of time). I can solve this the naive way, with a loop (or foreach ;-)) that is calculated per index t, but I wondered if there are some simple functions already implemented, e.g. a function to calculate the mean of all values in a given time frame. Since this can be done efficiently via simple summary statistics that slide over a window, it should be computationally cheaper than a function that accesses all of the data multiple times to calculate each statistic. Some fairly natural functions: mean, min, max, and median.

    Even if the window isn't varying by time, the ability to vary the window size would be adequate, and I can find that window size using the result of the question above. However, that still seems to require excess calculations, so being able to specify time-based intervals seems more efficient.

Are there packages in R that facilitate such manipulations of data in time-windows, or am I out of luck and I should write my own functions?


Note 1: This question seeks to do something similar, except over disjoint intervals, rather than rolling windows of time, e.g. I could adapt this to do my analysis on every successive 3 minute block, but I don't see a way to adapt this for rolling 3 minute intervals.

Note 2: I've found that switching from a zoo object to a numeric vector (for the times) has significantly sped up the issue of range-finding / window endpoint identification for the first goal. That's still a naive algorithm, but it's worth mentioning that working with zoo objects may not be optimal for the naive approach.

like image 632
Iterator Avatar asked May 05 '12 21:05

Iterator


2 Answers

Here's what I was suggeting, but I'm not sure it exactly answers your question

#Picking up where your code left off
library(xts)
library(TTR)
x <- .xts(vecZ, vecTimes)
xx <- na.locf(cbind(xts(, seq.POSIXt(from=start(x), to=end(x), by='sec')), x))
x$means <- runMean(xx, n=180)
out <- x[!is.na(x[, 1]), ]
tail(out)

                                  x     means
1969-12-31 18:28:17.376141 0.2053531 0.1325938
1969-12-31 18:28:17.379140 0.2101565 0.1329065
1969-12-31 18:28:17.619840 0.2139770 0.1332403
1969-12-31 18:28:17.762765 0.2072574 0.1335843
1969-12-31 18:28:17.866473 0.2065790 0.1339608
1969-12-31 18:28:17.924270 0.2114755 0.1344264
like image 66
GSee Avatar answered Oct 28 '22 01:10

GSee


As of version v1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to aggregate in a non-equi join which can be used to apply a rolling function on a sliding time window of an irregular time series.

For demonstration and verification, a smaller dataset is used.

library(data.table)   # development version 1.11.9 used

# create small dataset
set.seed(0)
nSamples    <- 10
vecDT       <- rexp(nSamples, 3)
vecTimes    <- cumsum(c(0,vecDT))
vecVals     <- 0:nSamples
vec         <- data.table(vecTimes, vecVals)
vec
      vecTimes vecVals
 1: 0.00000000       0
 2: 0.06134553       1
 3: 0.10991444       2
 4: 0.15651286       3
 5: 0.30186907       4
 6: 1.26685858       5
 7: 1.67671260       6
 8: 1.85660688       7
 9: 2.17546271       8
10: 2.22447804       9
11: 2.68805641      10
# define window size in seconds 
win_sec = 0.3

# aggregate in sliding window by a non-equi join
vec[.(t = vecTimes, upper = vecTimes + win_sec, lower = vecTimes - win_sec), 
    on = .(vecTimes < upper, vecTimes > lower), 
    .(t, .N, sliding_mean = mean(vecVals)), by = .EACHI]
     vecTimes     vecTimes          t N sliding_mean
 1: 0.3000000 -0.300000000 0.00000000 4          1.5
 2: 0.3613455 -0.238654473 0.06134553 5          2.0
 3: 0.4099144 -0.190085564 0.10991444 5          2.0
 4: 0.4565129 -0.143487143 0.15651286 5          2.0
 5: 0.6018691  0.001869065 0.30186907 4          2.5
 6: 1.5668586  0.966858578 1.26685858 1          5.0
 7: 1.9767126  1.376712596 1.67671260 2          6.5
 8: 2.1566069  1.556606875 1.85660688 2          6.5
 9: 2.4754627  1.875462707 2.17546271 2          8.5
10: 2.5244780  1.924478037 2.22447804 2          8.5
11: 2.9880564  2.388056413 2.68805641 1         10.0

The first two columns show the upper and lower bounds of the time intervall, resp., t is the original vecTimes, and N denotes the number of data points included in the calculation of the sliding mean.

like image 40
Uwe Avatar answered Oct 28 '22 01:10

Uwe