Large fixed effects binomial regression in R

Tags:

I need to run a logistic regression on a relatively large data frame with 480.000 entries with 3 fixed effect variables. Fixed effect var A has 3233 levels, var B has 2326 levels, var C has 811 levels. So all in all I have 6370 fixed effects. The data is cross-sectional. If I can't run this regression using the normal glm function because the regression matrix seems too large for my memory (I get the message "Error: cannot allocate vector of size 22.9 Gb"). I am looking for alternative ways to run this regression on my Macbook Air (OS X 10.9.5 8GB RAM). I also have access to a server with 16GB RAM.

I have tried solving the issue in a few different ways but so far none led to satisfactory results:

lfe/felm: Using the felm regression function of the lfe package that subtracts fixed effects before running the regression. This works perfectly and allowed my to run the above regression as a normal linear model in just a few minutes. However, lfe does not support logistic regressions and glms. So felm was great for getting an idea about model fit for different models but doesn't work for the final logistic regression models.

biglm/bigglm: I thought about using bigglm to break my function into more manageable chunks. However, several sources (e.g. link1, link2, link3) mention that in order for that to work, factor levels need be consistent across chunks, i.e. each chunk must contain at least one of each factor of each factor variable. Factor A and B contain levels that only appear once, so I can't split the sets into different chunks with consistent levels. If I delete 10 factors of fixed effect A and 8 factors of B (a minor change) I will only have factors with 4+ levels left, and splitting my data into 4 chunks will make it a lot more manageable already. However, then I still need to figure out how to sort my df in a way that will ensure that my 480.000 entries are sorted into 4 chunks in which each factor level of each of the 3 factors appearing at least once.

GlmmGS/glmgs: The glmmgs function in the package with the same name performs a fixed-effects subtraction like the lfe package for logistic regressions using a "Gauss-Seidel" Algorithm. Unfortunately, the package is no longer being developed. Being relatively new to R and having no deep experience with statistics I can't make sense of the output and have no idea of how to transform it in a way that would give me the normal "effect size", "model fit", "significance interval" indicators that glm regression summaries provide.

I sent a message to the package's authors. They kindly responded as follows:

The package provides no output in the same format of a glm object. However, you can easily calculate most of the fit statistics (standard error of the estimates, goodness of fit) given the current output (in the CRAN version, I believe that the current output is a vector of estimate of coefficients, and the associated vector of standard errors; same for the covariance components, but you need not worry about them if you are fitting model without random effects). Only beware that the covariance matrices used to calculated the standard errors are the inverse of the diagonal blocks of the precision matrix associated with the Gauss-Seidel algorithm, and so they tend to underestimate the standard errors of the joint likelihood. I am not maintaining the package any longer and I do not have time to get into the specific details; the seminal theory behind the package can be found in the paper referenced in the manual, everything else needs to be worked out by you with pen and paper :).

If anyone can explain how to "easily calculate most of the fit statistics" in a way that someone without any education in statistics can understand it (might be impossible) or provide R code that shows on example of how this could be done I would be much obliged!

Revolution Analytics: I installed revolution analytics enterprise on a virtual machine that simulates Windows 7 on my Mac. The program has a function called RxLogit that is optimized for large logistic regressions. Using the RxLogit function I get the error (Failed to allocate 326554568 bytes. Error in rxCall("RxLogit", params) : bad allocation), so that function also seems too run into memory issues. However, the software enables me to run my regression on a distributed computing cluster. So I could just "kill the problem" by purchasing computing time on a cluster with lots of memory. However, I wonder if the revolution analytics program provides any formulas or methods that I don't know off that would allow me to do some kind of lfe-like fixed-effects subtraction operation or bigglm-like chunking operation that takes factors into account.

MatrixModels/glm4: One person suggested I use the glm4 function of the MatrixModels package with the sparse = TRUE attribute to speed up calculation. If I run a glm4 regression with all fixed effects I get an "Error in Cholesky(crossprod(from), LDL = FALSE) : internal_chm_factor: Cholesky factorization failed" error. If I run it only with the fixed effect variables B OR A and C, the calculation works and returns a "glpModel" object. As with glmmGS I have some issues in turning that output into a form that makes sense to me since the standard summary() method does not seem to work on it.

I would be happy for advice on any of the issues mentioned above or also completely different approaches for running logistic regressions with multiple large fixed effects in R with memory constraints.

809

asked Feb 20 '15 16:02

Phil

2 Answers

Check out

glmmboot{glmmML}

http://cran.r-project.org/web/packages/glmmML/glmmML.pdf

There is also a nice document by Brostrom and Holmberg (http://cran.r-project.org/web/packages/eha/vignettes/glmmML.pdf)

Here is the example from their document:

dat <- data.frame(y = rbinom(5000, size = 1, prob = 0.5),
               x = rnorm(5000), group = rep(1:1000, each = 5))
fit1 <- glm(y ~ factor(group) + x, data = dat, family = binomial)

require(glmmML)
fit2 <- glmmboot(y ~ x, cluster = group,data = dat)

The computing time difference is "huge"!

123

answered Sep 25 '22 20:09

Hamed

For posterity, I'd also like to recommend the package speedglm, which I have found useful when trying to perform logistic regression on large data sets. It seems to use about half as much memory and finishes a lot quicker than glm.

answered Sep 23 '22 20:09

ecksc

Related questions
                            
                                Using tikz in html_document using RMarkdown
                            
                                Warning in install.packages: cannot open UR HTTP status was '404 Not Found'
                            
                                How do I revert to an earlier version of a package?
                            
                                R: extracting "clean" UTF-8 text from a web page scraped with RCurl
                            
                                Google Prediction using R
                            
                                How can I get the value of a kernel density estimate at specific points?
                            
                                Use R to push local repo to github on Windows
                            
                                R `Inf` when it has class `Date` is printing `NA`
                            
                                How to fill in the contour fully using stat_contour
                            
                                Couldn't quit R session or couldn't quit RStudio
                            
                                R How to estimate csv file size prior to writing it to disk
                            
                                Why is source speed different from RStudio console line code?
                            
                                How to remove duplicate legend entries w/ plotly subplots()
                            
                                reading and plotting an esri shape file in R
                            
                                What is the easiest way to parallelize a vectorized function in R?
                            
                                Geographical heat map in R
                            
                                Do any R IDEs support conditional breakpoints?
                            
                                in R how to get error messages in english
                            
                                Importing S3 method from another package
                            
                                opening webpages within shiny window without opening a separate window

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Large fixed effects binomial regression in R

Tags:

r

large-data

logistic-regression

mixed-models

microsoft-r

Phil

People also ask

2 Answers

Hamed

ecksc

Recent Activity

Donate For Us