The task: Find the slope of the best linear fit (e.g., minimize error variance) in a moving window. x values are equidistant, e.g. automatic measurements over time.
The problem: Performance is an issue, because it needs to be repeated for many datasets.
The naive implementation: Loop over the y values.
#some data
x <- 0:(8*60)
set.seed(42)
y <- -x^2*0.01+x*20+rnorm(8*60+1,mean=300,sd=50)
plot(y~x,pch=".")
optWinLinFit0 <- function(x,y,win_length) {
xfit <- x[seq_len(win_length)]
xfit <- xfit-min(xfit)
#regression on moving window
res <- lapply(seq_len(length(x)-win_length),function(i,x,y) {
y <- y[seq_len(win_length)+i-1]
list(y=y,fit = lm.fit(cbind(1,xfit),y))
},x=x, y=y)
#find fit with smallest sigma^2
winner <- which.min(sapply(res,function(x) 1/(win_length-2)*sum(x$fit$residuals^2)))
y <- res[[winner]]$y
#return fit summary and predicted values
list(n=winner,summary=summary(lm(y~xfit)),
dat=data.frame(x=x[-seq_len(winner-1)][seq_len(win_length)],
y=y,
ypred=res[[winner]]$fit$fitted.values))
}
res0 <- optWinLinFit0(x,y,180)
lines(ypred~x,data=res0$dat,col="red",lwd=2)
The red line gives the fitted values in the position of the moving window, where the error variance is minimal:
Any ideas how to do this faster?
You're basically doing a kernel regression. There are lots of functions and packages designed for this: KernSmooth
, gam
and locfit
come to mind. In base R, there is also loess
(and lowess
, an older version). More broadly, package mgcv
does the same thing, but using a different splines-based approach.
For what you're doing, I'd use either gam::gam
or mgcv::gam
and use finite differences on predictions on a grid. Only the former is based on an actual local regression, but they both answer the question being asked.
I don't see the need to reinvent the wheel. More importantly, using existing packages means that you'll be taking into account issues like bias at the endpoints, and at turning points in the curve (a local linear fit will be biased around a local maximum/minimum); weighting schemes to use; etc. You can also take advantage of the standard tools for model-building and checking, like cross-validation and so on.
The idea is to call lm
only once with a matrix of responses. This is faster by a factor of 2, but assumes that y values are non-zero. If zero values are possible, you could check for that and use optWinLinFit0
as a fall-back.
optWinLinFit1 <- function(x,y,win_length) {
xfit <- x[seq_len(win_length)]
xfit <- xfit-min(xfit)
#get all windows of values in one matrix
mat <- outer(y,rep(1,length(y)))
require(Matrix)
mat <- band(mat,k1=0,k2=win_length-1)
mat <- as.matrix(mat)
mat <- mat[,-(1:win_length-1)]
nc <- ncol(mat)
mat <- matrix(mat[mat!=0],ncol=nc)
#regression with response matrix
fit <- lm.fit(cbind(1,xfit),mat)
#find fit with smallest sigma^2
winner <- which.min(1/(win_length-2)*colSums(fit$residuals^2))
y <- mat[,winner]
#return fit summary and predicted values
list(n=winner,
summary=summary(lm(y~xfit)),
dat=data.frame(x=x[-seq_len(winner-1)][seq_len(win_length)],
y=y,
ypred=fit$fitted.values[,winner])
)
}
all.equal(res0$ypred,res1$ypred)
#[1] TRUE
library(microbenchmark)
microbenchmark(optWinLinFit0(x,y,180),optWinLinFit1(x,y,180),times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# optWinLinFit0(x, y, 180) 30.90678 31.73952 31.83930 35.61465 35.90352 10
# optWinLinFit1(x, y, 180) 12.76270 14.70842 15.70562 16.06347 17.41174 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With