I want to perform a stepwise linear Regression using p-values as a selection criterion, e.g.: at each step dropping variables that have the highest i.e. the most insignificant p-values, stopping when all values are significant defined by some threshold alpha. I am totally aware that I should use the AIC (e.g. command step or stepAIC) or some other criterion instead, but my boss has no grasp of statistics and insist on using p-values. If necessary, I could program my own routine, but I am wondering if there is an already implemented version of this.

Show your boss the following : <pre class="prettyprint"><code>set.seed(100) x1 <- runif(100,0,1) x2 <- as.factor(sample(letters[1:3],100,replace=T)) y <- x1+x1*(x2=="a")+2*(x2=="b")+rnorm(100) summary(lm(y~x1*x2)) </code></pre> Which gives : <pre class="prettyprint"><code> Estimate Std. Error t value Pr(>|t|) (Intercept) -0.1525 0.3066 -0.498 0.61995 x1 1.8693 0.6045 3.092 0.00261 ** x2b 2.5149 0.4334 5.802 8.77e-08 *** x2c 0.3089 0.4475 0.690 0.49180 x1:x2b -1.1239 0.8022 -1.401 0.16451 x1:x2c -1.0497 0.7873 -1.333 0.18566 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 </code></pre> Now, based on the p-values you would exclude which one? x2 is most significant and most non-significant at the same time. <hr> Edit : To clarify : This exaxmple is not the best, as indicated in the comments. The procedure in Stata and SPSS is AFAIK also not based on the p-values of the T-test on the coefficients, but on the F-test after removal of one of the variables. I have a function that does exactly that. This is a selection on "the p-value", but not of the T-test on the coefficients or on the anova results. Well, feel free to use it if it looks useful to you. <pre class="prettyprint"><code>##################################### # Automated model selection # Author : Joris Meys # version : 0.2 # date : 12/01/09 ##################################### #CHANGE LOG # 0.2 : check for empty scopevar vector ##################################### # Function has.interaction checks whether x is part of a term in terms # terms is a vector with names of terms from a model has.interaction <- function(x,terms){ out <- sapply(terms,function(i){ sum(1-(strsplit(x,":")[[1]] %in% strsplit(i,":")[[1]]))==0 }) return(sum(out)>0) } # Function Model.select # model is the lm object of the full model # keep is a list of model terms to keep in the model at all times # sig gives the significance for removal of a variable. Can be 0.1 too (see SPSS) # verbose=T gives the F-tests, dropped var and resulting model after model.select <- function(model,keep,sig=0.05,verbose=F){ counter=1 # check input if(!is(model,"lm")) stop(paste(deparse(substitute(model)),"is not an lm object\n")) # calculate scope for drop1 function terms <- attr(model$terms,"term.labels") if(missing(keep)){ # set scopevars to all terms scopevars <- terms } else{ # select the scopevars if keep is used index <- match(keep,terms) # check if all is specified correctly if(sum(is.na(index))>0){ novar <- keep[is.na(index)] warning(paste( c(novar,"cannot be found in the model", "\nThese terms are ignored in the model selection."), collapse=" ")) index <- as.vector(na.omit(index)) } scopevars <- terms[-index] } # Backward model selection : while(T){ # extract the test statistics from drop. test <- drop1(model, scope=scopevars,test="F") if(verbose){ cat("-------------STEP ",counter,"-------------\n", "The drop statistics : \n") print(test) } pval <- test[,dim(test)[2]] names(pval) <- rownames(test) pval <- sort(pval,decreasing=T) if(sum(is.na(pval))>0) stop(paste("Model", deparse(substitute(model)),"is invalid. Check if all coefficients are estimated.")) # check if all significant if(pval[1]<sig) break # stops the loop if all remaining vars are sign. # select var to drop i=1 while(T){ dropvar <- names(pval)[i] check.terms <- terms[-match(dropvar,terms)] x <- has.interaction(dropvar,check.terms) if(x){i=i+1;next} else {break} } # end while(T) drop var if(pval[i]<sig) break # stops the loop if var to remove is significant if(verbose){ cat("\n--------\nTerm dropped in step",counter,":",dropvar,"\n--------\n\n") } #update terms, scopevars and model scopevars <- scopevars[-match(dropvar,scopevars)] terms <- terms[-match(dropvar,terms)] formul <- as.formula(paste(".~.-",dropvar)) model <- update(model,formul) if(length(scopevars)==0) { warning("All variables are thrown out of the model.\n", "No model could be specified.") return() } counter=counter+1 } # end while(T) main loop return(model) } </code></pre>

Why not try using the <code>step()</code> function specifying your testing method? For example, for backward elimination, you type only a command: <pre class="prettyprint"><code>step(FullModel, direction = "backward", test = "F") </code></pre> and for stepwise selection, simply: <pre class="prettyprint"><code>step(FullModel, direction = "both", test = "F") </code></pre> This can display both the AIC values as well as the F and P values.

Stepwise regression using p-values to drop variables with nonsignificant p-values

Q: What is p-value in backward elimination?

The first step in backward elimination is pretty simple, you just select a significance level, or select the P-value. Usually, in most cases, a 5% significance level is selected. This means the P-value will be 0.05. You can change this value depending on the project.

Q: What if p-value is high in regression?

This variable is statistically significant and probably a worthwhile addition to your regression model. On the other hand, a p-value that is greater than the significance level indicates that there is insufficient evidence in your sample to conclude that a non-zero correlation exists.

Tags:

r

statistics

regression

p-value

I want to perform a stepwise linear Regression using p-values as a selection criterion, e.g.: at each step dropping variables that have the highest i.e. the most insignificant p-values, stopping when all values are significant defined by some threshold alpha.

I am totally aware that I should use the AIC (e.g. command step or stepAIC) or some other criterion instead, but my boss has no grasp of statistics and insist on using p-values.

If necessary, I could program my own routine, but I am wondering if there is an already implemented version of this.

592

asked Sep 13 '10 14:09

DainisZ

2 Answers

Show your boss the following :

set.seed(100) x1 <- runif(100,0,1) x2 <- as.factor(sample(letters[1:3],100,replace=T))  y <- x1+x1*(x2=="a")+2*(x2=="b")+rnorm(100) summary(lm(y~x1*x2))

Which gives :

            Estimate Std. Error t value Pr(>|t|)     (Intercept)  -0.1525     0.3066  -0.498  0.61995     x1            1.8693     0.6045   3.092  0.00261 **  x2b           2.5149     0.4334   5.802 8.77e-08 *** x2c           0.3089     0.4475   0.690  0.49180     x1:x2b       -1.1239     0.8022  -1.401  0.16451     x1:x2c       -1.0497     0.7873  -1.333  0.18566     --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now, based on the p-values you would exclude which one? x2 is most significant and most non-significant at the same time.

Edit : To clarify : This exaxmple is not the best, as indicated in the comments. The procedure in Stata and SPSS is AFAIK also not based on the p-values of the T-test on the coefficients, but on the F-test after removal of one of the variables.

I have a function that does exactly that. This is a selection on "the p-value", but not of the T-test on the coefficients or on the anova results. Well, feel free to use it if it looks useful to you.

##################################### # Automated model selection # Author      : Joris Meys # version     : 0.2 # date        : 12/01/09 ##################################### #CHANGE LOG # 0.2   : check for empty scopevar vector #####################################  # Function has.interaction checks whether x is part of a term in terms # terms is a vector with names of terms from a model has.interaction <- function(x,terms){     out <- sapply(terms,function(i){         sum(1-(strsplit(x,":")[[1]] %in% strsplit(i,":")[[1]]))==0     })     return(sum(out)>0) }  # Function Model.select # model is the lm object of the full model # keep is a list of model terms to keep in the model at all times # sig gives the significance for removal of a variable. Can be 0.1 too (see SPSS) # verbose=T gives the F-tests, dropped var and resulting model after  model.select <- function(model,keep,sig=0.05,verbose=F){       counter=1       # check input       if(!is(model,"lm")) stop(paste(deparse(substitute(model)),"is not an lm object\n"))       # calculate scope for drop1 function       terms <- attr(model$terms,"term.labels")       if(missing(keep)){ # set scopevars to all terms           scopevars <- terms       } else{            # select the scopevars if keep is used           index <- match(keep,terms)           # check if all is specified correctly           if(sum(is.na(index))>0){               novar <- keep[is.na(index)]               warning(paste(                   c(novar,"cannot be found in the model",                   "\nThese terms are ignored in the model selection."),                   collapse=" "))               index <- as.vector(na.omit(index))           }           scopevars <- terms[-index]       }        # Backward model selection :         while(T){           # extract the test statistics from drop.           test <- drop1(model, scope=scopevars,test="F")            if(verbose){               cat("-------------STEP ",counter,"-------------\n",               "The drop statistics : \n")               print(test)           }            pval <- test[,dim(test)[2]]            names(pval) <- rownames(test)           pval <- sort(pval,decreasing=T)            if(sum(is.na(pval))>0) stop(paste("Model",               deparse(substitute(model)),"is invalid. Check if all coefficients are estimated."))            # check if all significant           if(pval[1]<sig) break # stops the loop if all remaining vars are sign.            # select var to drop           i=1           while(T){               dropvar <- names(pval)[i]               check.terms <- terms[-match(dropvar,terms)]               x <- has.interaction(dropvar,check.terms)               if(x){i=i+1;next} else {break}                         } # end while(T) drop var            if(pval[i]<sig) break # stops the loop if var to remove is significant            if(verbose){              cat("\n--------\nTerm dropped in step",counter,":",dropvar,"\n--------\n\n")                         }            #update terms, scopevars and model           scopevars <- scopevars[-match(dropvar,scopevars)]           terms <- terms[-match(dropvar,terms)]            formul <- as.formula(paste(".~.-",dropvar))           model <- update(model,formul)            if(length(scopevars)==0) {               warning("All variables are thrown out of the model.\n",               "No model could be specified.")               return()           }           counter=counter+1       } # end while(T) main loop       return(model) }

110

answered Oct 12 '22 20:10

Joris Meys

Why not try using the step() function specifying your testing method?

For example, for backward elimination, you type only a command:

step(FullModel, direction = "backward", test = "F")

and for stepwise selection, simply:

step(FullModel, direction = "both", test = "F")

This can display both the AIC values as well as the F and P values.

answered Oct 12 '22 19:10

leonie

Related questions
                            
                                How to change order of boxplots when using ggplot2?
                            
                                Piping stdin to R
                            
                                ggplot2 make missing value in geom_tile not blank
                            
                                R dplyr rolling sum
                            
                                How to find the minimum value of a column in R?
                            
                                Going to Python from R, what's the python equivalent of a data frame?
                            
                                A matrix version of cor.test()
                            
                                Remove columns with zero values from a dataframe
                            
                                Why is as.Date slow on a character vector?
                            
                                Crop for SpatialPolygonsDataFrame
                            
                                Checking for identical columns in a data frame in R
                            
                                Replace accented characters in R with non-accented counterpart (UTF-8 encoding) [duplicate]
                            
                                How to scrape the web for the list of R release dates?
                            
                                Clean variables and close connections
                            
                                Sourcing R script over HTTPS
                            
                                Unused arguments in R
                            
                                The simplest algorithm for poker hand evaluation
                            
                                Opening Shiny App directly in the default browser
                            
                                RStudio suddenly stopped showing plots in the plot pane
                            
                                R - how to replace parts of variable strings within data frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With