I have a long numerical time series data of approximately 200,000 rows (lets call it Z).
In a loop, I subset x (about 30) consecutive rows from Z at a time and treat them as the query point q.
I want to locate within Z the y (~300) most correlated time series segments of length x (most correlated with q).
What is an efficient way to accomplish this?
The code below finds the 300 segments you are looking for and runs in 8 seconds on my none too powerful Windows laptop, so it should be fast enough for your purposes.
First, it constructs a 30-by-199971 matrix (Zmat
), whose columns contain all of the length-30 "time series segments" you want to examine. A single call to cor()
, operating on the vector q
and the matrix Zmat
, then calculates all of the desired correlation coefficients. Finally, the resultant vector is examined to identify the 300 sequences having the highest correlation coefficients.
# Simulate data
nZ <- 200000
nq <- 30
Z <- rnorm(nZ)
q <- seq_len(nq)
# From Z, construct a 30 by 199971 matrix, in which each column is a
# "time series segment". Column 1 contains observations 1:30, column 2
# contains observations 2:31, and so on through the end of the series.
Zmat <- sapply(seq_len(nZ - nq + 1),
FUN = function(X) Z[seq(from = X, length.out = nq)])
# Calculate the correlation of q with every column/"time series segment.
Cors <- cor(q, Zmat)
# Extract the starting position of the 300 most highly correlated segments
ids <- order(Cors, decreasing=TRUE)[1:300]
# Maybe try something like the following to confirm that you have
# selected the most highly correlated segments.
hist(Cors, breaks=100)
hist(Cors[ids], col="red", add=TRUE)
The naive solution is indeed very slow (at least several minutes -- I am not patient enough):
library(zoo)
n <- 2e5
k <- 30
z <- rnorm(n)
x <- rnorm(k) # We do not use the fact that x is a part of z
rollapply(z, k, function(u) cor(u,x), align="left")
You can compute the correlation by hand, from the first moments and comoments, but it still takes several minutes.
y <- zoo(rnorm(n), 1:n)
x <- rnorm(k)
exy <- exx <- eyy <- ex <- ey <- zoo( rep(0,n), 1:n )
for(i in 1:k) {
cat(i, "\n")
exy <- exy + lag(y,i-1) * x[i]
ey <- ey + lag(y,i-1)
eyy <- eyy + lag(y,i-1)^2
ex <- ex + x[i] # Constant time series
exx <- exx + x[i]^2 # Constant time series
}
exy <- exy/k
ex <- ex/k
ey <- ey/k
exx <- exx/k
eyy <- eyy/k
covxy <- exy - ex * ey
vx <- exx - ex^2
vy <- eyy - ey^2
corxy <- covxy / sqrt( vx * vy )
Once you have the time series of the correlations, it is easy to extract the position of the top 300.
i <- order(corxy, decreasing=TRUE)[1:300]
corxy[i]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With