R: Regression with a holdout of certain variables

Question

I'm doing a multi-linear regression model using lm(), Y is response variable (e.g.: return of interests) and others are explanatory variable (100+ cases, 30+ variables).

There are certain variables which are considered as key variables (concerning investment), when I ran the lm() function, R returns a model with adj.r.square of 97%. But some of the key variables are not significant predictors.

Is there a way to do a regression by keeping all of the key variables in the model (as significant predictors)? It doesn't matter if the adjusted R square decreases.

If the regression doesn't work, is there other methodology?

thank you!

==========================

the data set is uploaded https://www.dropbox.com/s/gh61obgn2jr043y/df.csv

==========================

additional questions: what if some variables have impact from previous period to current period? Example: one takes a pill in the morning when he/she has breakfast and the effect of pills might last after lunch (and he/she takes the 2nd pill at lunch) I suppose I need to take consideration of data transformation. * My first choice is to plus a carry-over rate: obs.2_trans = obs.2 + c-o rate * obs.1 * Maybe I also need to consider the decay of pill effect itself, so a s-curve or a exponential transformation is also necessary.

take variable main1 for example, I can use try-out method to get an ideal c-o rate and s-curve parameter starting from 0.5 and testing by step of 0.05, up to 1 or down to 0, until I get the highest model score - say, lowest AIC or highest R square. This is already a huge quantity to test. If I need to test more than 3 variables in the same time, how could I manage that by R?

Thank you!

jlhoward · Accepted Answer

First, a note on "significance". For each variable included in a model, the linear modeling packages report the likelihood that the coefficient of this variable is different from zero (actually, they report p=1-L). We say that, if L is larger (smaller p), then the coefficient is "more significant". So, while it is quite reasonable to talk about one variable being "more significant" than another, there is no absolute standard for asserting "significant" vs. "not significant". In most scientific research, the cutoff is L>0.95 (p<0.05). But this is completely arbitrary, and there are many exceptions. recall that CERN was unwilling to assert the existence of the Higgs boson until they had collected enough data to demonstrate its effect at 6-sigma. This corresponds roughly to p < 1 × 10^-9. At the other extreme, many social science studies assert significance at p < 0.2 (because of the higher inherent variability and usually small number of samples). So excluding a variable from a model because it is "not significant" really has no meaning. On the other hand you would be hard pressed to include a variable with high p while excluding another variable with lower p.

Second, if your variables are highly correlated (which they are in your case), then it is quite common that removing one variable from a model changes all the p-values greatly. A retained variable that had a high p-value (less significant), might suddenly have low p-value (more significant), just because you removed a completely different variable from the model. Consequently, trying to optimize a fit manually is usually a bad idea.

Fortunately, there are many algorithms that do this for you. One popular approach starts with a model that has all the variables. At each step, the least significant variable is removed and the resulting model is compared to the model at the previous step. If removing this variable significantly degrades the model, based on some metric, the process stops. A commonly used metric is the Akaike information criterion (AIC), and in R we can optimize a model based on the AIC criterion using stepAIC(...) in the MASS package.

Third, the validity of regression models depends on certain assumptions, especially these two: the error variance is constant (does not depend on y), and the distribution of error is approximately normal. If these assumptions are not met, the p-values are completely meaningless!! Once we have fitted a model we can check these assumptions using a residual plot and a Q-Q plot. It is essential that you do this for any candidate model!

Finally, the presence of outliers frequently distorts the model significantly (almost by definition!). This problem is amplified if your variables are highly correlated. So in your case it is very important to look for outliers, and see what happens when you remove them.

The code below rolls this all up.

library(MASS)
url <- "https://dl.dropboxusercontent.com/s/gh61obgn2jr043y/df.csv?dl=1&token_hash=AAGy0mFtfBEnXwRctgPHsLIaqk5temyrVx_Kd97cjZjf8w&expiry=1399567161"
df <- read.csv(url)
initial.fit <- lm(Y~.,df[,2:ncol(df)]) # fit with all variables (excluding PeriodID)
final.fit   <- stepAIC(initial.fit)    # best fit based on AIC
par(mfrow=c(2,2))
plot(initial.fit)                      # diagnostic plots for base model
plot(final.fit)                        # same for best model
summary(final.fit)
# ...
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  11.38360   18.25028   0.624  0.53452    
# Main1       911.38514  125.97018   7.235 2.24e-10 ***
# Main3         0.04424    0.02858   1.548  0.12547    
# Main5         4.99797    1.94408   2.571  0.01195 *  
# Main6         0.24500    0.10882   2.251  0.02703 *  
# Sec1        150.21703   34.02206   4.415 3.05e-05 ***
# Third2       -0.11775    0.01700  -6.926 8.92e-10 ***
# Third3       -0.04718    0.01670  -2.826  0.00593 ** 
# ... (many other variables included)
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 22.76 on 82 degrees of freedom
# Multiple R-squared:  0.9824,  Adjusted R-squared:  0.9779 
# F-statistic:   218 on 21 and 82 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(initial.fit)
title("Base Model",outer=T,line=-2)
plot(final.fit)
title("Best Model (AIC)",outer=T,line=-2)

So you can see from this that the "best model", based on the AIC metric, does in fact include Main 1,3,5, and 6, but not Main 2 and 4. The residuals plot shows no dependance on y (which is good), and the Q-Q plot demonstrates approximate normality of the residuals (also good). On the other hand the Leverage plot shows a couple of points (rows 33 and 85) with exceptionally high leverage, and the Q-Q plot shows these same points and row 47 as having residuals not really consistent with a normal distribution. So we can re-run the fits excluding these rows as follows.

initial.fit <- lm(Y~.,df[c(-33,-47,-85),2:ncol(df)])
final.fit   <- stepAIC(initial.fit,trace=0)
summary(final.fit)
# ...
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   27.11832   20.28556   1.337 0.185320    
# Main1       1028.99836  125.25579   8.215 4.65e-12 ***
# Main2          2.04805    1.11804   1.832 0.070949 .  
# Main3          0.03849    0.02615   1.472 0.145165    
# Main4         -1.87427    0.94597  -1.981 0.051222 .  
# Main5          3.54803    1.99372   1.780 0.079192 .  
# Main6          0.20462    0.10360   1.975 0.051938 .  
# Sec1         129.62384   35.11290   3.692 0.000420 ***
# Third2        -0.11289    0.01716  -6.579 5.66e-09 ***
# Third3        -0.02909    0.01623  -1.793 0.077060 .  
# ... (many other variables included)

So excluding these rows results in a fit that has all the "Main" variables with p < 0.2, and all except Main 3 at p < 0.1 (90%). I'd want to look at these three rows and see if there is a legitimate reason to exclude them.

Finally, just because you have a model that fits your existing data well, does not mean that it will perform well as a predictive model. In particular, if you are trying to make predictions outside of the "model space" (equivalent to extrapolation), then your predictive power is likely to be poor.

imgschatz · Answer

Significance is determined by the relationships in your data .. not by "I want them to be significant".

If the data says they are insignificant, then they are insignificant.

You are going to have a hard time getting any significance with 30 variables, and only 100 observations. With only 100+ observations, you should only be using a few variables. With 30 variables, you'd need 1000's of observations to get any significance.

Maybe start with the variables you think should be significant, and see what happens.

R: Regression with a holdout of certain variables

Tags:

r

regression

Elliott

2 Answers

jlhoward

imgschatz

Recent Activity

Donate For Us

R: Regression with a holdout of certain variables

Tags:

r

regression

Elliott

2 Answers

jlhoward

imgschatz

Related questions

Recent Activity

Donate For Us