Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the max number of variables once can use in an exhaustive all-subsets regression using glmulti()

I am using the glmulti() package in R to try and run an all-subset regression on some data. I have 51 predictors, all with a maximum of 276 observations. I realize that the exhaustive and genetic algorithm approaches cannot compute with this many variables as I receive the following:

Warning message:
In glmulti(y = "Tons_N", data = MDatEB1_TonsN, level = 1, method = "h",  :
  !Too many predictors.

With these types of requirements (i.e. many variables with lots of observations), how many will I be able to use in a single run of the all-subsets regression? I am looking into variable elimination techniques but I would like to use as many variables as possible in this stage of the analysis. That is, I want to use the results of this analysis to make variable elimination decisions. Is there another package that can process more variables at a time?

Here is the code I am using. Unfortunately, because of the confidentiality associated with the project, I cannot attach datasets.

TonsN_AllSubset <- glmulti(Tons_N ~ ., data = MDatEB1_TonsN, level = 1, method = "h",crit = "aic", confsetsize = 20, plotty = T, report = T,fitfunction = "glm")

I am relatively new to this package and modeling in general. Any direction or advice will be greatly appreciated. Thank you!

like image 980
user2701157 Avatar asked Sep 19 '25 06:09

user2701157


2 Answers

glmulti is not restricted by the number of predictors, but by the number of candidate models.

By setting the argument method = "d", glmulti will compute the number of candidate models. Computing this takes considerably less time than running glmulti on method = "h" or method = "g".

If the number of predictors is too high, you will get the same error message. Thereby, you can try out the maximum number of predictors to be handled by glmulti within a reasonable computing time.

However, keep in mind that the maximum number of possible predictors depends strongly on whether you allow for interactions or not.

Furthermore, you can limit the number of candidate models by specifying the number of predictors in the model (eg. minsize = 0, maxsize = 1) or by excluding (exclude = c()) specific predictors or by excluding predictors in the model formula (y~a+b+c-a:b-1; this excludes the intercept and the interaction a:b). You find even more options for limiting the number of candidate models in the package documentation glmulti.pdf

like image 73
MsGISRocker Avatar answered Sep 21 '25 23:09

MsGISRocker


The glmnet package provides the facilities to do penalized modeling without the statistically flawed strategy of stepwise selection. (There seems to be a wide spread acceptance of the fallacious argument that using AIC protects one from problems of multiple comparisons.) It is incredibly easy to "find" statistically significant relations where there are none.

This is the result of using BabakP's suggestion with a random set of predictors:

pseudodata = data.frame(matrix(NA,nrow=276,ncol=51))
pseudodata[,1] = rbinom(nrow(pseudodata),1,.3)

n1 = length(which(pseudodata[,1]==1))
n0 = length(which(pseudodata[,1]==0))
 for(i in 2:ncol(pseudodata)){
    pseudodata[,i] = ifelse(pseudodata[,1]==1, rnorm(n1), rnorm(n0))
    }
model = glm(pseudodata[,1]~., data=pseudodata[-1])
stepwise.model = step(model,direction="both",trace=FALSE)

> summary(stepwise.model)

Call:
glm(formula = pseudodata[, 1] ~ X4 + X6 + X10 + X17 + X21 + X23 + 
    X25 + X29 + X32 + X37 + X41 + X48 + X50 + X19, data = pseudodata[-1])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6992  -0.2943  -0.1154   0.3663   0.9833  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.25674    0.02561  10.025  < 2e-16 ***
X4          -0.03573    0.02394  -1.493 0.136727    
X6          -0.05045    0.02608  -1.934 0.054141 .  
X10          0.05873    0.02744   2.141 0.033235 *  
X17         -0.06325    0.02520  -2.510 0.012668 *  
X21          0.06420    0.02504   2.564 0.010906 *  
X23         -0.04961    0.02845  -1.744 0.082353 .  
X25          0.03863    0.02517   1.535 0.126035    
X29          0.04889    0.02381   2.054 0.041020 *  
X32         -0.03669    0.02509  -1.462 0.144841    
X37          0.09682    0.02507   3.862 0.000142 ***
X41         -0.05253    0.02676  -1.963 0.050704 .  
X48         -0.06660    0.02279  -2.922 0.003782 ** 
X50         -0.06955    0.02624  -2.651 0.008517 ** 
X19         -0.04090    0.02701  -1.514 0.131137    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.1674429)

    Null deviance: 55.072  on 275  degrees of freedom
Residual deviance: 43.703  on 261  degrees of freedom
AIC: 306.59

Number of Fisher Scoring iterations: 2
like image 43
IRTFM Avatar answered Sep 21 '25 22:09

IRTFM