Running R cv.glmnet function from glmnet package with large sparse datasets I often get following error:
# Error: Matrices must have same number of columns in .local(x, y, ...)
I have replicated the error with randomly generated data:
set.seed(10)
X <- matrix(rbinom(5000, 1, 0.1), nrow=1000, ncol=5)
X[, 1] <- 0
X[1, 1] <- 1
Y <- rep(0, 1000)
Y[c(1:20)] <- 1
model <- cv.glmnet(x=X, y=Y, family="binomial", alpha=0.9, standardize=T,
nfolds=4)
This might be related to initial variable screening (based on inner product of X
and Y
). Instead of fixing coefficient to zero glmnet drops the variable from X
matrix and this is done for each of the validation sets. Then if variable is dropped in some of them and kept in others the error appears.
Sometimes increasing nfolds
helps. Which is in line with hypothesis as higher number of nfolds
means larger validation subsets and smaller chance of dropping the variable in any of them.
A few additional notes:
Error appears only for alpha
close to 1 (alpha=1
is equivalent to L1 regularization) and using standardization. It does not appear for family="Gaussian"
.
What do you think could be happening?
This example is problematic, because one variable has a single 1 and the rest are zero. This is a case where logistic regression can diverge (if not regularized), since driving that coefficient to infinity (plus or minus depending on the response) will predict that observation perfectly, and not impact anything else.
Now the model is regularized, so this should not happen, but it does cause problems. I found by making alpha smaller (toward ridge, .5 for this example) the problem went away.
The real problem here is to do with the lambda sequence used for each fold, but this gets a little technical. I will try and make a fix to cv.glmnet that makes this problem go away.
Trevor Hastie (glmnet maintainer)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With