I am using RStudio 0.97.320 (R 2.15.3) on Amazon EC2. My data frame has 200k rows and 12 columns.
I am trying to fit a logistic regression with approximately 1500 parameters.
R is using 7% CPU and has 60+GB memory and is still taking a very long time.
Here is the code:
glm.1.2 <- glm(formula = Y ~ factor(X1) * log(X2) * (X3 + X4 * (X5 + I(X5^2)) * (X8 + I(X8^2)) + ((X6 + I(X6^2)) * factor(X7))),
family = binomial(logit), data = df[1:150000,])
Any suggestions to speed this up by a significant amount?
Although a bit late but I can only encourage dickoa's suggestion to generate a sparse model matrix using the Matrix package and then feeding this to the speedglm.wfit function. That works great ;-) This way, I was able to run a logistic regression on a 1e6 x 3500 model matrix in less than 3 minutes.
There are a couple packages to speed up glm
fitting. fastglm has benchmarks showing it to be even faster than speedglm
.
You could also install a more performant BLAS library on your computer (as Ben Bolker suggests in comments), which will help any method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With