I'm trying to fit a model using loess, and I'm getting errors such as "pseudoinverse used at 3", "neighborhood radius 1", and "reciprocal condition number 0". Here's a MWE:
x = 1:19
y = c(NA,71.5,53.1,53.9,55.9,54.9,60.5,NA,NA,NA
,NA,NA,178.0,180.9,180.9,NA,NA,192.5,194.7)
fit = loess(formula = y ~ x,
control = loess.control(surface = "direct"),
span = 0.3, degree = 1)
x2 = seq(0,20,.1)
library(ggplot2)
qplot(x=x2
,y=predict(fit, newdata=data.frame(x=x2))
,geom="line")
I realize I can fix these errors by choosing a larger span value. However, I'm trying to automate this fit, as I have about 100,000 time series (each of length about 20) similar to this. Is there a way that I can automatically choose a span value that will prevent these errors while still providing a fairly flexible fit to the data? Or, can anyone explain what these errors mean? I did a bit of poking around in the loess() and simpleLoess() functions, but I gave up at the point when C code was called.
The name 'loess' stands for Locally Weighted Least Squares Regression. So, it uses more local data to estimate our Y variable. But it is also known as a variable bandwidth smoother, in that it uses a 'nearest neighbors' method to smooth.
Loess Regression is the most common method used to smoothen a volatile time series. It is a non-parametric methods where least squares regression is performed in localized subsets, which makes it a suitable candidate for smoothing any numerical vector.
lowess is for adding a smooth curve to a scatterplot, i.e., for univariate smoothing. loess is for fitting a smooth surface to multivariate data. Both algorithms use locally-weighted polynomial regression, usually with robustifying iterations.
Compare fit$fitted
to y
. You'll notice that something is wrong with your regression. Choose adequate bandwidth, otherwise it'll just interpolate the data. With too few data points, linear function behaves like constant on small bandwidth and triggers collinearity. Thus, you see the errors warning pseudoinverses, singularities. You wont see such errors if you use degree=0
or ksmooth
. One intelligible, data-driven choice of span
is to use to cross-validation, about which you can ask at Cross Validated.
> fit$fitted
[1] 71.5 53.1 53.9 55.9 54.9 60.5 178.0 180.9 180.9 192.5 194.7
> y
[1] NA 71.5 53.1 53.9 55.9 54.9 60.5 NA NA NA NA NA 178.0
[14] 180.9 180.9 NA NA 192.5 194.7
You see over-fit( perfect-fit) because in your model number of parameters are as many as effective sample size.
fit
#Call:
#loess(formula = y ~ x, span = 0.3, degree = 1, control = loess.control(surface = "direct"))
#Number of Observations: 11
#Equivalent Number of Parameters: 11
#Residual Standard Error: Inf
Or, you might as well just use automated geom_smooth
. (again setting geom_smooth(span=0.3)
throws warnings)
ggplot(data=data.frame(x, y), aes(x, y)) +
geom_point() + geom_smooth()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With