Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting the statistically significant variables in an R glm model

Tags:

r

glm

I have an outcome variable, say Y and a list of 100 dimensions that could affect Y (say X1...X100).

After running my glm and viewing a summary of my model, I see those variables that are statistically significant. I would like to be able to select those variables and run another model and compare performance. Is there a way I can parse the model summary and select only the ones that are significant?

like image 725
Pritish Kakodkar Avatar asked Apr 22 '13 17:04

Pritish Kakodkar


People also ask

How do you determine significant variables in regression?

The overall F-test determines whether this relationship is statistically significant. If the P value for the overall F-test is less than your significance level, you can conclude that the R-squared value is significantly different from zero.

How do you choose the best variables for a linear regression?

When building a linear or logistic regression model, you should consider including: Variables that are already proven in the literature to be related to the outcome. Variables that can either be considered the cause of the exposure, the outcome, or both. Interaction terms of variables that have large main effects.

How do you choose the best predictor variable?

Generally variable with highest correlation is a good predictor. You can also compare coefficients to select the best predictor (Make sure you have normalized the data before you perform regression and you take absolute value of coefficients) You can also look change in R-squared value.


1 Answers

Although @kith paved the way, there is more that can be done. Actually, the whole process can be automated. First, let's create some data:

x1 <- rnorm(10)
x2 <- rnorm(10)
x3 <- rnorm(10)
y <- rnorm(10)
x4 <- y + 5 # this will make a nice significant variable to test our code
(mydata <- as.data.frame(cbind(x1,x2,x3,x4,y)))

Our model is then:

model <- glm(formula=y~x1+x2+x3+x4,data=mydata)

And the Boolean vector of the coefficients can indeed be extracted by:

toselect.x <- summary(model)$coeff[-1,4] < 0.05 # credit to kith

But this is not all! In addition, we can do this:

# select sig. variables
relevant.x <- names(toselect.x)[toselect.x == TRUE] 
# formula with only sig variables
sig.formula <- as.formula(paste("y ~",relevant.x))  

EDIT: as subsequent posters have pointed out, the latter line should be sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+"))) to include all variables.

And run the regression with only significant variables as OP originally wanted:

sig.model <- glm(formula=sig.formula,data=mydata)

In this case the estimate will be equal to 1 as we have defined x4 as y+5, implying the perfect relationship.

like image 153
Maxim.K Avatar answered Oct 13 '22 10:10

Maxim.K