I need to run a logistic regression on a relatively large data frame with 480.000 entries with 3 fixed effect variables. Fixed effect var A has 3233 levels, var B has 2326 levels, var C has 811 levels. So all in all I have 6370 fixed effects. The data is cross-sectional. If I can't run this regression using the normal glm
function because the regression matrix seems too large for my memory (I get the message "Error: cannot allocate vector of size 22.9 Gb
"). I am looking for alternative ways to run this regression on my Macbook Air (OS X 10.9.5 8GB RAM). I also have access to a server with 16GB RAM.
I have tried solving the issue in a few different ways but so far none led to satisfactory results:
lfe/felm:
Using the felm regression function of the lfe
package that subtracts fixed effects before running the regression. This works perfectly and allowed my to run the above regression as a normal linear model in just a few minutes. However, lfe
does not support logistic regressions and glms. So felm was great for getting an idea about model fit for different models but doesn't work for the final logistic regression models.
biglm/bigglm:
I thought about using bigglm
to break my function into more manageable chunks. However, several sources (e.g. link1, link2, link3) mention that in order for that to work, factor levels need be consistent across chunks, i.e. each chunk must contain at least one of each factor of each factor variable. Factor A and B contain levels that only appear once, so I can't split the sets into different chunks with consistent levels. If I delete 10 factors of fixed effect A and 8 factors of B (a minor change) I will only have factors with 4+ levels left, and splitting my data into 4 chunks will make it a lot more manageable already. However, then I still need to figure out how to sort my df in a way that will ensure that my 480.000 entries are sorted into 4 chunks in which each factor level of each of the 3 factors appearing at least once.
GlmmGS/glmgs:
The glmmgs
function in the package with the same name performs a fixed-effects subtraction like the lfe
package for logistic regressions using a "Gauss-Seidel" Algorithm. Unfortunately, the package is no longer being developed. Being relatively new to R and having no deep experience with statistics I can't make sense of the output and have no idea of how to transform it in a way that would give me the normal "effect size", "model fit", "significance interval" indicators that glm regression summaries provide.
I sent a message to the package's authors. They kindly responded as follows:
The package provides no output in the same format of a glm object. However, you can easily calculate most of the fit statistics (standard error of the estimates, goodness of fit) given the current output (in the CRAN version, I believe that the current output is a vector of estimate of coefficients, and the associated vector of standard errors; same for the covariance components, but you need not worry about them if you are fitting model without random effects). Only beware that the covariance matrices used to calculated the standard errors are the inverse of the diagonal blocks of the precision matrix associated with the Gauss-Seidel algorithm, and so they tend to underestimate the standard errors of the joint likelihood. I am not maintaining the package any longer and I do not have time to get into the specific details; the seminal theory behind the package can be found in the paper referenced in the manual, everything else needs to be worked out by you with pen and paper :).
If anyone can explain how to "easily calculate most of the fit statistics" in a way that someone without any education in statistics can understand it (might be impossible) or provide R code that shows on example of how this could be done I would be much obliged!
Revolution Analytics:
I installed revolution analytics enterprise on a virtual machine that simulates Windows 7 on my Mac. The program has a function called RxLogit
that is optimized for large logistic regressions. Using the RxLogit
function I get the error (Failed to allocate 326554568 bytes. Error in rxCall("RxLogit", params) : bad allocation)
, so that function also seems too run into memory issues. However, the software enables me to run my regression on a distributed computing cluster. So I could just "kill the problem" by purchasing computing time on a cluster with lots of memory. However, I wonder if the revolution analytics program provides any formulas or methods that I don't know off that would allow me to do some kind of lfe
-like fixed-effects subtraction operation or bigglm
-like chunking operation that takes factors into account.
MatrixModels/glm4:
One person suggested I use the glm4
function of the MatrixModels
package with the sparse = TRUE
attribute to speed up calculation. If I run a glm4
regression with all fixed effects I get an "Error in Cholesky(crossprod(from), LDL = FALSE) : internal_chm_factor: Cholesky factorization failed
" error. If I run it only with the fixed effect variables B OR A and C, the calculation works and returns a "glpModel"
object. As with glmmGS
I have some issues in turning that output into a form that makes sense to me since the standard summary()
method does not seem to work on it.
I would be happy for advice on any of the issues mentioned above or also completely different approaches for running logistic regressions with multiple large fixed effects in R with memory constraints.
The fixed effects logistic regression is a conditional model also referred to as a subject-specific model as opposed to being a population-averaged model. The fixed effects logistic regression models have the ability to control for all fixed characteristics (time independent) of the individuals.
The fixed effects panel model can be estimated using dummy variables for each entity (with cross-sectional fixed effects) or for each time period (for time fixed effects) and this would be known as the least squares dummy variable approach.
Researchers should feel secure using either fixed- or random-effects models under standard conditions, as dictated by the practical and theoretical aspects of a given application. Either way, both approaches are strictly preferable to the pooled model.
Check out
glmmboot{glmmML}
http://cran.r-project.org/web/packages/glmmML/glmmML.pdf
There is also a nice document by Brostrom and Holmberg (http://cran.r-project.org/web/packages/eha/vignettes/glmmML.pdf)
Here is the example from their document:
dat <- data.frame(y = rbinom(5000, size = 1, prob = 0.5),
x = rnorm(5000), group = rep(1:1000, each = 5))
fit1 <- glm(y ~ factor(group) + x, data = dat, family = binomial)
require(glmmML)
fit2 <- glmmboot(y ~ x, cluster = group,data = dat)
The computing time difference is "huge"!
For posterity, I'd also like to recommend the package speedglm
, which I have found useful when trying to perform logistic regression on large data sets. It seems to use about half as much memory and finishes a lot quicker than glm
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With