R version 2.15.0 (2012-03-30) RStudio 0.96.316 Win XP, last update
I do have a dataset with 40 variables and 15.000 observations. I would like to use bestglm to search for possible good models (logistic regression). I've tried bestglm, however it doesn't work for such medium sized dataset. After several trials, I think bestglm fails when there is more then approx 30 variables, at least on my computer (4G ram, dual core).
You can try bestglm limits on your own:
library(bestglm)
bestBIC_test <- function(number_of_vars) {
# Simulate data frame for logistic regression
glm_sample <- as.data.frame(matrix(rnorm(100*number_of_vars), 100))
# Get some 1/0 variable
glm_sample[,number_of_vars][glm_sample[,number_of_vars] > mean(glm_sample[,number_of_vars]) ] <- 1
glm_sample[,number_of_vars][glm_sample[,number_of_vars] != 1 ] <- 0
# Try to calculate best model
bestBIC <- bestglm(glm_sample, IC="BIC", family=binomial)
}
# Test bestglm with increasing number of variables
bestBIC_test(10) # OK, running
bestBIC_test(20) # OK, running
bestBIC_test(25) # OK, running
bestBIC_test(28) # Error: cannot allocate vector of size 1024.0 Mb
bestBIC_test(30) # Error: cannot allocate vector of size 2.0 Gb
bestBIC_test(40) # Error in rep(-Inf, 2^p) : invalid 'times' argument
Are there any alternatives I can use in R to search for possible good models?
Well, for starters an exhaustive search for the best subset of 40 variables requires creating 2^40 models which is over a trillion. That is likely your issue.
Exhaustive best subsets search is generally not considered optimal for over 20 or so variables.
A better bet is something like forward stepwise selection which is around (40^2+40)/2 models so around 800.
Or even BETTER (best in my opinion) regularized logistic regression using the lasso via the glmnet
package.
Good overview here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With