I'm trying to understand the output from predict(), as well as understand whether this approach is appropriate for the problem I'm trying to solve. The prediction intervals don't make sense to me, but when I plot this on a scatterplot it looks like a good model:
I created a simple linear regression model of deal size ($) with a company's sales volume as a predictor variable. The data is faked, with deal size being a multiple of sales volume plus or minus some noise:
Call:
lm(formula = deal_size ~ sales_volume, data = accounts)
Residuals:
Min 1Q Median 3Q Max
-19123502 -3794671 -3426616 4838578 17328948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.709e+06 1.727e+05 21.48 <2e-16 ***
sales_volume 1.898e-01 2.210e-03 85.88 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6452000 on 1586 degrees of freedom
Multiple R-squared: 0.823, Adjusted R-squared: 0.8229
F-statistic: 7376 on 1 and 1586 DF, p-value: < 2.2e-16
The predictions were generated thusly:
d = data.frame(accounts, predict(fit, interval="prediction"))
When I plot sales_volume vs. deal_size on a scatterplot, and overlay the regression line with the prediction interval, it looks good, except for a few intervals that span negative values where sales is at or near zero.
I understand fit is the predicted value, but what are lwr and upr? Do they define the intervals in absolute terms (y coordinates)? The intervals seem to be extremely wide, wider than would make sense if my model was a good fit:
sales_volume deal_size fit lwr upr
0 0 3709276.494 -8950776.04 16369329.03
0 8586337.22 3709276.494 -8950776.04 16369329.03
110000 549458.6512 3730150.811 -8929897.298 16390198.92
When you use predict
with an lm
model, you can specify an interval
. You have three choices: none
will not return intervals, confidence
and prediction
. Both of those will return different values. The first column will be as you said the predicted values (column fit
). You then have two other columns : lwr
and upper
which are the lower and upper levels of the confidence intervals.
What is the difference between confidence
and prediction
?
confidence
is a (by default 95%, use level
if you wish to change that) confidence interval of the mean of the predicted value. It is the green interval on your plot. Whereas prediction
is a (also 95%) confidence interval of all your values, meaning that should you repeat your experience/survey/... a huge number of times, you can expect that 95% of your values will fall in the yellow interval, thus making it a lot wider than the green one as the green one only evaluates the mean.
And as you an see on your plot, almost all values are in the yellow interval. R doesn't know that your values can only be positive so it explains why the yellow interval "begins" under 0.
Also, when you say "The intervals seem to be extremely wide, wider than would make sense if my model was a good fit", you can see in your plot that the interval is not that big, considering that you can expect 95% of the values to be in it, and you can clearly see a trend in your data. And your model is clearly a good fit as the adjusted R squared and the global p-value tells you.
Just a slight rephrasing of @etienne above, which is very good and accurate.
Confidence interval is the (1-alpha; eg 95%) interval for the mean prediction (or group response). IE if you have 10 new companies with sales volume of 2e+08
the predict(..., interval= "confidence")
interval will give you the long-run average interval for your group mean.
With Var(\hat y|X= x*) = \sigma^2 (1/n + (x*-\bar x)^2 / SXX)
The prediction interval is the (1-alpha; eg 95%) interval for an individual response -- predict(..., interval= "predict")
. IE for a single new company with sales volume of 2e+08
...
With Var(\hat y|X= x*) = \sigma^2 (1 + 1/n + (x*-\bar x)^2 / SXX)
(Sorry that LaTeX isn't supported)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With