Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Faster Way to Calculate Rolling Statistics Over a Variable Interval

I'm curious if anyone out there can come up with a (faster) way to calculate rolling statistics (rolling mean, median, percentiles, etc.) over a variable interval of time (windowing).

That is, suppose one is given randomly timed observations (i.e. not daily, or weekly data, observations just have a time stamp, as in ticks data), and suppose you'd like to look at center and dispersion statistics that you are able to widen and tighten the interval of time over which these statistics are calculated.

I made a simple for loop that does this. But it obviously runs very slow (In fact I think my loop is still running over a small sample of data I set up to test its speed). I've been trying to get something like ddply to do this - which seems straitforward to get to run for daily stats - but I can't seem to work my way out of it.

Example:

Sample Set-up:

df <- data.frame(Date = runif(1000,0,30))
df$Price <- I((df$Date)^0.5 * (rnorm(1000,30,4)))
df$Date <- as.Date(df$Date, origin = "1970-01-01")

Example Function (that runs really slow with many observations

SummaryStats <- function(dataframe, interval){
  # Returns daily simple summary stats, 
  # at varying intervals
  # dataframe is the data frame in question, with Date and Price obs
  # interval is the width of time to be treated as a day

  firstDay <- min(dataframe$Date)
  lastDay  <- max(dataframe$Date)
  result <- data.frame(Date = NULL,
                       Average = NULL,  Median = NULL,
                       Count = NULL,
                       Percentile25 = NULL, Percentile75 = NULL)

  for (Day in firstDay:lastDay){

    dataframe.sub = subset(dataframe,
                Date > (Day - (interval/2))
                & Date < (Day + (interval/2)))

    nu = data.frame(Date = Day, 
                    Average = mean(dataframe.sub$Price),
                    Median = median(dataframe.sub$Price),
                    Count = length(dataframe.sub$Price),
                    P25 = quantile(dataframe.sub$Price, 0.25),
                    P75 = quantile(dataframe.sub$Price, 0.75))

    result = rbind(result,nu)

  }

  return(result)

}

Your advice would be welcome!

like image 889
EconomiCurtis Avatar asked Nov 22 '13 00:11

EconomiCurtis


3 Answers

Think of all of the points connected as a chain. Think of this chain as a graph, where each data point is a node. Then, for each node, we want to find all other nodes that are distance w or less away. To do this, I first generate a matrix that gives pairwise distances. The nth row gives the distance for nodes n nodes apart.

# First, some data
x = sort(runif(25000,0,4*pi))
y = sin(x) + rnorm(length(x),0,0.5)

# calculate the rows of the matrix one by one
# until the distance between the two closest nodes is greater than w
# This algorithm is actually faster than `dist` because it usually stops
# much sooner
dl = list()
dl[[1]] = diff(x)
i = 1
while( min(dl[[i]]) <= w ) {
  pdl = dl[[i]]
  dl[[i+1]] = pdl[-length(pdl)] + dl[[1]][-(1:i)]
  i = i+1
}

# turn the list of the rows into matrices
rarray = do.call( rbind, lapply(dl,inf.pad,length(x)) )
larray = do.call( rbind, lapply(dl,inf.pad,length(x),"right") )

# extra function
inf.pad = function(x,size,side="left") {
  if(side=="left") {
    x = c( x, rep(Inf, size-length(x) ) )
  } else {
    x = c( rep(Inf, size-length(x) ), x )
  }
  x
}

I then use the matrices to determine the edge of each window. For this example, I set w=2.

# How many data points to look left or right at each data point
lookr = colSums( rarray <= w )
lookl = colSums( larray <= w )

# convert these "look" variables to indeces of the input vector
ri = 1:length(x) + lookr
li = 1:length(x) - lookl

With the windows defined, it's pretty simple to use the *apply functions to get the final answer.

rolling.mean = vapply( mapply(':',li,ri), function(i) .Internal(mean(y[i])), 1 )

All of the above code took about 50 seconds on my computer. This is a little faster than the rollmean_r function in my other answer. However, the especially nice thing here is that the indeces are provided. You could then use whatever R function you like with the *apply functions. For example,

rolling.mean = vapply( mapply(':',li,ri), 
                                        function(i) .Internal(mean(y[i])), 1 )

takes about 5 seconds. And,

rolling.median = vapply( mapply(':',li,ri), 
                                        function(i) median(y[i]), 1 )

takes about 14 seconds. If you wanted to, you could use the Rcpp function in my other answer to get the indeces.

like image 196
kdauria Avatar answered Nov 18 '22 18:11

kdauria


Let's see... you are doing a loop( very slow in R), making unnecessary copies of data in creating subset, and using rbind to accumulate you data set. If you avoid those, things will speed up considerably. Try this...

Summary_Stats <- function(Day, dataframe, interval){
    c1 <- dataframe$Date > Day - interval/2 & 
        dataframe$Date < Day + interval/2
    c(
        as.numeric(Day),
        mean(dataframe$Price[c1]),
        median(dataframe$Price[c1]),
        sum(c1),
        quantile(dataframe$Price[c1], 0.25),
        quantile(dataframe$Price[c1], 0.75)
      )
}
Summary_Stats(df$Date[2],dataframe=df, interval=20)
firstDay <- min(df$Date)
lastDay  <- max(df$Date)
system.time({
    x <- sapply(firstDay:lastDay, Summary_Stats, dataframe=df, interval=20)
    x <- as.data.frame(t(x))
    names(x) <- c("Date","Average","Median","Count","P25","P75")
    x$Date <- as.Date(x$Date)
})
dim(x)
head(x)
like image 22
ndr Avatar answered Nov 18 '22 17:11

ndr


In reply to my question to "Kevin" above, I think I figured something out below.

This function takes ticks data (time observations come in at random intervals are and indicated by a time stamp) and calculates the the mean over an interval.

library(Rcpp)

cppFunction('
  NumericVector rollmean_c2( NumericVector x, NumericVector y, double width,
                              double Min, double Max) {

double total = 0, redge,center;
unsigned int n = (Max - Min) + 1,
                  i, j=0, k, ledge=0, redgeIndex;
NumericVector out(n);


for (i = 0; i < n; i++){
  center = Min + i + 0.5;
  redge = center - width / 2;
  redgeIndex = 0;
  total = 0;

  while (x[redgeIndex] < redge){
    redgeIndex++;
  }
  j = redgeIndex;

  while (x[j] < redge + width){
    total += y[j++];

  }

  out[i] = total / (j - redgeIndex);
}
return out;

  }')

# Set up example data
x = seq(0,4*pi,length.out=2500)
y = sin(x) + rnorm(length(x),0.5,0.5)
plot(x,y,pch=20,col="black",
     main="Sliding window mean; width=1",
     sub="rollmean_c in red      rollmean_r overlaid in white.")


c.out = rollmean_c2(x,y,width=1,Min = min(x), Max = max(x)) 
lines(0.5:12.5,c.out,col="red",lwd=3)

enter image description here

like image 2
EconomiCurtis Avatar answered Nov 18 '22 16:11

EconomiCurtis