I have an outcome variable, say Y and a list of 100 dimensions that could affect Y (say X1...X100).
After running my glm
and viewing a summary of my model, I see those variables that are statistically significant. I would like to be able to select those variables and run another model and compare performance. Is there a way I can parse the model summary and select only the ones that are significant?
The overall F-test determines whether this relationship is statistically significant. If the P value for the overall F-test is less than your significance level, you can conclude that the R-squared value is significantly different from zero.
When building a linear or logistic regression model, you should consider including: Variables that are already proven in the literature to be related to the outcome. Variables that can either be considered the cause of the exposure, the outcome, or both. Interaction terms of variables that have large main effects.
Generally variable with highest correlation is a good predictor. You can also compare coefficients to select the best predictor (Make sure you have normalized the data before you perform regression and you take absolute value of coefficients) You can also look change in R-squared value.
Although @kith paved the way, there is more that can be done. Actually, the whole process can be automated. First, let's create some data:
x1 <- rnorm(10)
x2 <- rnorm(10)
x3 <- rnorm(10)
y <- rnorm(10)
x4 <- y + 5 # this will make a nice significant variable to test our code
(mydata <- as.data.frame(cbind(x1,x2,x3,x4,y)))
Our model is then:
model <- glm(formula=y~x1+x2+x3+x4,data=mydata)
And the Boolean vector of the coefficients can indeed be extracted by:
toselect.x <- summary(model)$coeff[-1,4] < 0.05 # credit to kith
But this is not all! In addition, we can do this:
# select sig. variables
relevant.x <- names(toselect.x)[toselect.x == TRUE]
# formula with only sig variables
sig.formula <- as.formula(paste("y ~",relevant.x))
EDIT: as subsequent posters have pointed out, the latter line should be sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+")))
to include all variables.
And run the regression with only significant variables as OP originally wanted:
sig.model <- glm(formula=sig.formula,data=mydata)
In this case the estimate will be equal to 1 as we have defined x4 as y+5, implying the perfect relationship.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With