Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

bestglm alternatives for dataset with many variables

Tags:

r

R version 2.15.0 (2012-03-30) RStudio 0.96.316 Win XP, last update

I do have a dataset with 40 variables and 15.000 observations. I would like to use bestglm to search for possible good models (logistic regression). I've tried bestglm, however it doesn't work for such medium sized dataset. After several trials, I think bestglm fails when there is more then approx 30 variables, at least on my computer (4G ram, dual core).

You can try bestglm limits on your own:

library(bestglm)

bestBIC_test <- function(number_of_vars) {

# Simulate data frame for logistic regression
glm_sample <- as.data.frame(matrix(rnorm(100*number_of_vars), 100))

# Get some 1/0 variable
glm_sample[,number_of_vars][glm_sample[,number_of_vars] > mean(glm_sample[,number_of_vars]) ] <- 1
glm_sample[,number_of_vars][glm_sample[,number_of_vars] != 1 ] <- 0

# Try to calculate best model
bestBIC  <- bestglm(glm_sample, IC="BIC", family=binomial)

}

# Test bestglm with increasing number of variables
bestBIC_test(10) # OK, running
bestBIC_test(20) # OK, running
bestBIC_test(25) # OK, running
bestBIC_test(28) # Error: cannot allocate vector of size 1024.0 Mb
bestBIC_test(30) # Error: cannot allocate vector of size 2.0 Gb
bestBIC_test(40) # Error in rep(-Inf, 2^p) : invalid 'times' argument

Are there any alternatives I can use in R to search for possible good models?

like image 471
Tomas Greif Avatar asked Dec 16 '22 20:12

Tomas Greif


1 Answers

Well, for starters an exhaustive search for the best subset of 40 variables requires creating 2^40 models which is over a trillion. That is likely your issue.

Exhaustive best subsets search is generally not considered optimal for over 20 or so variables.

A better bet is something like forward stepwise selection which is around (40^2+40)/2 models so around 800.

Or even BETTER (best in my opinion) regularized logistic regression using the lasso via the glmnet package.

Good overview here.

like image 200
Dirk Calloway Avatar answered Jan 11 '23 05:01

Dirk Calloway