I was looking for a way to do a linear regression under positive constraints, therefore came across the nnls approach. However I was wondering how I could get the same statistics from the nnls as the one provided by an lm object. More specifically the R-squared, the akaike information criterion, the p-values and confidence intervals.
library(arm)
library(nnls)
data = runif(100*4, min = -1, max = 1)
data = matrix(data, ncol = 4)
colnames(data) = c("y", "x1", "x2", "x3")
data = as.data.frame(data)
data$x1 = -data$y
A = as.matrix(data[,c("x1", "x2", "x3")])
b = data$y
test = nnls(A,b)
print(test)
Is there a way to reestimate in an lm framework, using offset and fixing the coefficient did not work... Is there a way to obtain these statistics ? Or another way to create an lm object with positivity constraints on the coefficients?
Thanks Romain.
What you are proposing to do is a massively bad idea, so much so that I'm reluctant to show you how to do it. The reason is that for OLS, assuming the residuals are normally distributed with constant variance, then the parameter estimates follow a multivariate t-distribution and we can calculate confidence limits and p-values in the usual way.
However, if we perform NNLS on the same data, the residuals will not be normally ditributed, and the standard techniques for calculating p-values, etc. will produce garbage. There are methods for estimating confidence limits on the parameters of an NNLS fit (see this reference for instance), but they are approximate and usually rely on fairly restrictive assumptions about the dataset.
On the other hand, it would be nice if some of the more basic functions for an lm
object, such as predict(...)
, coeff(...)
, residuals(...)
, etc. also worked for the result of an NNLS fit. So one way to acheive that is use nls(...)
: just because a model is linear in the parameters does not mean you cannot use non-linear least squares to find the parameters. nls(...)
offers the option to set lower (and upper) limits on the parameters if you use the port
algorithm.
set.seed(1) # for reproducible example
data <- as.data.frame(matrix(runif(1e4, min = -1, max = 1),nc=4))
colnames(data) <-c("y", "x1", "x2", "x3")
data$y <- with(data,-10*x1+x2 + rnorm(2500))
A <- as.matrix(data[,c("x1", "x2", "x3")])
b <- data$y
test <- nnls(A,b)
test
# Nonnegative least squares model
# x estimates: 0 1.142601 0
# residual sum-of-squares: 88391
# reason terminated: The solution has been computed sucessfully.
fit <- nls(y~b.1*x1+b.2*x2+b.3*x3,data,algorithm="port",lower=c(0,0,0))
fit
# Nonlinear regression model
# model: y ~ b.1 * x1 + b.2 * x2 + b.3 * x3
# data: data
# b.1 b.2 b.3
# 0.000 1.143 0.000
# residual sum-of-squares: 88391
As you can see, the result of using nnls(...)
and the result of using nls(...)
with lower-c(0,0,0)
are identical. But nls(...)
produces an nls
object, which supports (most of) the same methods as an lm
object. So you can write precict(fit)
, coef(fit)
, residuals(fit)
, AIC(fit)
etc. You can also write summary(fit)
and confint(fit)
but beware: the values you get are not meaningful!!!
To illustrate the point about the residuals, we compare the residuals for an OLS fit to this data, with the residuals for the NNLS fit.
par(mfrow=c(1,2),mar=c(3,4,1,1))
qqnorm(residuals(lm(y~.,data)),main="OLS"); qqline(residuals(lm(y~.,data)))
qqnorm(residuals(fit),main="NNLS"); qqline(residuals(fit))
In this dataset, the stochastic part of the variability in y
is N(0,1) by design, so the residuals from the OLS fit (Q-Q plot on the left) are normal. But the residuals from the same dataset fitted using NNLS are not remotely normal. This is because the true dependance of y
on x1
is -10
, but the NNLS fit is forcing it to 0. Consequently, the proportion of very large residuals (both positive and negative) is much higher than would be expected from the normal distribution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With