Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding output of 'predict' in R

Tags:

r

prediction

I'm trying to understand the output from predict(), as well as understand whether this approach is appropriate for the problem I'm trying to solve. The prediction intervals don't make sense to me, but when I plot this on a scatterplot it looks like a good model:

enter image description here

I created a simple linear regression model of deal size ($) with a company's sales volume as a predictor variable. The data is faked, with deal size being a multiple of sales volume plus or minus some noise:

    Call:
lm(formula = deal_size ~ sales_volume, data = accounts)

Residuals:
      Min        1Q    Median        3Q       Max 
-19123502  -3794671  -3426616   4838578  17328948 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.709e+06  1.727e+05   21.48   <2e-16 ***
sales_volume 1.898e-01  2.210e-03   85.88   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6452000 on 1586 degrees of freedom
Multiple R-squared:  0.823, Adjusted R-squared:  0.8229 
F-statistic:  7376 on 1 and 1586 DF,  p-value: < 2.2e-16

The predictions were generated thusly:

d = data.frame(accounts, predict(fit, interval="prediction"))

When I plot sales_volume vs. deal_size on a scatterplot, and overlay the regression line with the prediction interval, it looks good, except for a few intervals that span negative values where sales is at or near zero.

I understand fit is the predicted value, but what are lwr and upr? Do they define the intervals in absolute terms (y coordinates)? The intervals seem to be extremely wide, wider than would make sense if my model was a good fit:

sales_volume    deal_size    fit            lwr          upr
0               0            3709276.494    -8950776.04  16369329.03
0               8586337.22   3709276.494    -8950776.04  16369329.03      
110000          549458.6512  3730150.811    -8929897.298 16390198.92
like image 377
Nate Reed Avatar asked Dec 25 '22 12:12

Nate Reed


2 Answers

When you use predict with an lm model, you can specify an interval. You have three choices: none will not return intervals, confidence and prediction. Both of those will return different values. The first column will be as you said the predicted values (column fit). You then have two other columns : lwr and upper which are the lower and upper levels of the confidence intervals.

What is the difference between confidence and prediction ?

confidence is a (by default 95%, use level if you wish to change that) confidence interval of the mean of the predicted value. It is the green interval on your plot. Whereas prediction is a (also 95%) confidence interval of all your values, meaning that should you repeat your experience/survey/... a huge number of times, you can expect that 95% of your values will fall in the yellow interval, thus making it a lot wider than the green one as the green one only evaluates the mean.

And as you an see on your plot, almost all values are in the yellow interval. R doesn't know that your values can only be positive so it explains why the yellow interval "begins" under 0.

Also, when you say "The intervals seem to be extremely wide, wider than would make sense if my model was a good fit", you can see in your plot that the interval is not that big, considering that you can expect 95% of the values to be in it, and you can clearly see a trend in your data. And your model is clearly a good fit as the adjusted R squared and the global p-value tells you.

like image 148
etienne Avatar answered Jan 11 '23 18:01

etienne


Just a slight rephrasing of @etienne above, which is very good and accurate.

Confidence interval is the (1-alpha; eg 95%) interval for the mean prediction (or group response). IE if you have 10 new companies with sales volume of 2e+08 the predict(..., interval= "confidence") interval will give you the long-run average interval for your group mean.

With Var(\hat y|X= x*) = \sigma^2 (1/n + (x*-\bar x)^2 / SXX)

The prediction interval is the (1-alpha; eg 95%) interval for an individual response -- predict(..., interval= "predict"). IE for a single new company with sales volume of 2e+08...

With Var(\hat y|X= x*) = \sigma^2 (1 + 1/n + (x*-\bar x)^2 / SXX)

(Sorry that LaTeX isn't supported)

like image 28
Alex W Avatar answered Jan 11 '23 16:01

Alex W