Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Model runs with glm but not bigglm

I was trying to run a logistic regression on 320,000 rows of data (6 variables). Stepwise model selection on a sample of the data (10000) gives a rather complex model with 5 interaction terms: Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5. The glm() function could fit this model with 10000 rows of data, but not with the whole dataset (320,000).

Using bigglm to read data chunk by chunk from a SQL server resulted in an error, and I couldn't make sense of the results from traceback():

fit <- bigglm(Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5, 
       data=sqlQuery(myconn,train_dat),family=binomial(link="logit"), 
       chunksize=1000, maxit=10)

Error in coef.bigqr(object$qr) : 
NA/NaN/Inf in foreign function call (arg 3)

> traceback()
11: .Fortran("regcf", as.integer(p), as.integer(p * p/2), bigQR$D, 
    bigQR$rbar, bigQR$thetab, bigQR$tol, beta = numeric(p), nreq = as.integer(nvar), 
    ier = integer(1), DUP = FALSE)
10: coef.bigqr(object$qr)
9: coef(object$qr)
8: coef.biglm(iwlm)
7: coef(iwlm)
6: bigglm.function(formula = formula, data = datafun, ...)
5: bigglm(formula = formula, data = datafun, ...)
4: bigglm(formula = formula, data = datafun, ...)

bigglm was able to fit a smaller model with fewer interaction terms. but bigglm was not able to fit the same model with a small dataset (10000 rows).

Has anyone run into this problem before? Any other approach to run a complex logistic model with big data?

like image 457
ybeybe Avatar asked Jun 19 '14 22:06

ybeybe


2 Answers

I've run into this problem many times and it was always caused by the fact that the the chunks processed by the bigglm did not contain all the levels in a categorical (factor) variable.

bigglm crunches data by chunks and the default size of the chunk is 5000. If you have, say, 5 levels in your categorical variable, e.g. (a,b,c,d,e) and in your first chunk (from 1:5000) contains only (a,b,c,d), but no "e" you will get this error.

What you can do is increase the size of the "chunksize" argument and/or cleverly reorder your dataframe so that each chunk contains ALL the levels.

hope this helps (at least somebody)

like image 200
Jaroslaw Piskorski Avatar answered Nov 14 '22 12:11

Jaroslaw Piskorski


Ok so we were able to find the cause for this problem:

for one category in one of the interaction terms, there's no observation. "glm" function was able to run and provide "NA" as the estimated coefficient, but "bigglm" doesn't like it. "bigglm" was able to run the model if I drop this interaction term.

I'll do more research on how to deal with this kind of situation.

like image 38
ybeybe Avatar answered Nov 14 '22 14:11

ybeybe