I am having a problem to run glmnet lasso with a wide data set. My data has N=50, but p > 49000, all factors. So to run glmnet i have to create a model.matrix, BUT i just run out of memory when i call model.matrix(formula, data), where formula = Class ~ .
As a worked example i will generate a dataset:
data <- matrix(rep(0,50*49000), nrow=50)
for(i in 1:50) {
x = rep(letters[2:8], 7000)
y = sample(x=1:49000, size=49000)
data[i,] <- x[y]
}
data <- as.data.frame(data)
x = c(rep('A', 20), rep('B', 15), rep('C', 15))
y = sample(x=1:50, size=50)
class = x[y]
data <- cbind(data, class)
After that i tried to create a model.matrix to enter on glmnet.
formula <- as.formula(class ~ .)
X = model.matrix(formula, data)
model <- cv.glmnet(X, class, standardize=FALSE, family='multinomial', alpha=1, nfolds=10)
In the last step (X = model.matrix ...) i run out of memory. What can i do?
The cv. glmnet() function only optimizes over lambda ; it assumes that alpha , the variable that specifies the mix of the ridge and lasso penalties, is fixed. The glmnetUtils package provides a function called cva. glmnet() that will simultaneously cross-validate for both alpha and lambda .
cv. glmnet() performs cross-validation, by default 10-fold which can be adjusted using nfolds. A 10-fold CV will randomly divide your observations into 10 non-overlapping groups/folds of approx equal size. The first fold will be used for validation set and the model is fit on 9 folds.
By default glmnet chooses the lambda. 1se . It is the largest λ at which the MSE is within one standard error of the minimal MSE. Along the lines of overfitting, this usually reduces overfitting by selecting a simpler model (less non zero terms) but whose error is still close to the model with the least error.
Glmnet is a package that fits generalized linear and similar models via penalized maximum likelihood. The regularization path is computed for the lasso or elastic net penalty at a grid of values (on the log scale) for the regularization parameter lambda.
I asked Professor Trevor Hastie and received the following advice:
"Hello Flavio
model.matrix is killing you. You will have 49K factors, and model matrix is trying to represent them as contrasts which will be 6 column matrices, so 49*6 approx 300K columns. Why not make binary dummy variables (7 per factor), and simply construct this directly without using model.matrix. You could save 1/7th the space by storing this via sparseMatrix (glmnet accepts sparse matrix formats)"
I did exactly that and worked perfectly fine. I think that can be usefull to others.
An article, with code, that came form this problem: http://www.rmining.net/2014/02/25/genetic-data-large-matrices-glmnet/
In order to avoid broken links i will post part of the post here:
The problem with the formula approach is that, in general, genomic data has more columns than observations. The data that I worked in that case had 40,000 columns and only 73 observations. In order to create a small set of test data, run the following code:
for(i in 1:50) {
x = rep(letters[2:8], 7000)
y = sample(x=1:49000, size=49000)
data[i,] <- x[y]
}
data <- as.data.frame(data)
x <- c(rep('A', 20), rep('B', 15), rep('C', 15))
y <- sample(x=1:50, size=50)
class = x[y]
data <- cbind(data, class)
So, with this data set we will try to fit a model with glmnet ():
formula <- as.formula(class ~ .)
X <- model.matrix(formula, data)
model <- cv.glmnet(X, class, standardize=FALSE, family='multinomial', alpha=1, nfolds=10)
And if you do not have a computer with more RAM than mine, you will probably leak memory and give a crash in R. The solution? My first idea was to try sparse.model.matrix() that creates a sparse matrix model using the same formula. Unfortunately did not work, because even with sparse matrix, the final model is still too big! Interestingly, this dataset occupies only 24MB from RAM, but when you use the model.matrix the result is an array with more than 1Gb.
The solution I found was to build the matrix on hand. To do this we encode the array with dummy variables, column by column, and store the result in a sparse matrix. Then we will use this matrix as input to the model and see if it will not leak memory:
## Creates a matrix using the first column
X <- sparse.model.matrix(~data[,1]-1)
## Check if the column have more then one level
for (i in 2:ncol(data)) {
## In the case of more then one level apply dummy coding
if (nlevels(data[,i])>1) {
coluna <- sparse.model.matrix(~data[,i]-1)
X <- cBind(X, coluna)
}
## Transform fator to numeric
else {
coluna <- as.numeric(as.factor(data[,i]))
X <- cBind(X, coluna)
}
NOTE: Pay attention to how we are using a sparse matrix the Matrix package is required. Also note that the columns are connected using cBind () instead of cbind ().
The matrix thus generated was much lower: less than 70 Mb when I tested. Fortunately glmnet () supports a sparse matrix and you can run the model:
mod.lasso <- cv.glmnet(X, class, standardize=FALSE, family='multinomial', alpha=1, nfolds=10)
So you can create models with this type of data without blowing the memory and without use R packages for large datasets like bigmemory and ff.
To whom might be interested. I have developed an R package called biglasso
which fits lasso-type models with Big Data. It works with memory-mapped (big) design matrix based on bigmemory
package, and can seamlessly work for data-larger-than-RAM cases. Moreover, it's more computation- and memory-efficient as compared to glmnet
by using newly proposed feature screening rules as well as better implementation. Please check the GitHub page for details, and feel free to provide any suggestions/comments.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With