Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Big matrix to run glmnet()

Tags:

I am having a problem to run glmnet lasso with a wide data set. My data has N=50, but p > 49000, all factors. So to run glmnet i have to create a model.matrix, BUT i just run out of memory when i call model.matrix(formula, data), where formula = Class ~ .

As a worked example i will generate a dataset:

data <- matrix(rep(0,50*49000), nrow=50)
for(i in 1:50) {
x = rep(letters[2:8], 7000)
y = sample(x=1:49000, size=49000)
data[i,] <- x[y]
}

data <- as.data.frame(data)
x = c(rep('A', 20), rep('B', 15), rep('C', 15))
y = sample(x=1:50, size=50)
class = x[y]
data <- cbind(data, class)

After that i tried to create a model.matrix to enter on glmnet.

  formula <- as.formula(class ~ .)
  X = model.matrix(formula, data)
  model <- cv.glmnet(X, class, standardize=FALSE, family='multinomial', alpha=1, nfolds=10)

In the last step (X = model.matrix ...) i run out of memory. What can i do?

like image 927
Flavio Barros Avatar asked Jun 10 '13 20:06

Flavio Barros


People also ask

What is Alpha in CV Glmnet?

The cv. glmnet() function only optimizes over lambda ; it assumes that alpha , the variable that specifies the mix of the ridge and lasso penalties, is fixed. The glmnetUtils package provides a function called cva. glmnet() that will simultaneously cross-validate for both alpha and lambda .

How does CV Glmnet work?

cv. glmnet() performs cross-validation, by default 10-fold which can be adjusted using nfolds. A 10-fold CV will randomly divide your observations into 10 non-overlapping groups/folds of approx equal size. The first fold will be used for validation set and the model is fit on 9 folds.

How does Glmnet choose Lambda?

By default glmnet chooses the lambda. 1se . It is the largest λ at which the MSE is within one standard error of the minimal MSE. Along the lines of overfitting, this usually reduces overfitting by selecting a simpler model (less non zero terms) but whose error is still close to the model with the least error.

What package is Glmnet in?

Glmnet is a package that fits generalized linear and similar models via penalized maximum likelihood. The regularization path is computed for the lasso or elastic net penalty at a grid of values (on the log scale) for the regularization parameter lambda.


2 Answers

I asked Professor Trevor Hastie and received the following advice:

"Hello Flavio

model.matrix is killing you. You will have 49K factors, and model matrix is trying to represent them as contrasts which will be 6 column matrices, so 49*6 approx 300K columns. Why not make binary dummy variables (7 per factor), and simply construct this directly without using model.matrix. You could save 1/7th the space by storing this via sparseMatrix (glmnet accepts sparse matrix formats)"

I did exactly that and worked perfectly fine. I think that can be usefull to others.

An article, with code, that came form this problem: http://www.rmining.net/2014/02/25/genetic-data-large-matrices-glmnet/

In order to avoid broken links i will post part of the post here:

The problem with the formula approach is that, in general, genomic data has more columns than observations. The data that I worked in that case had 40,000 columns and only 73 observations. In order to create a small set of test data, run the following code:

for(i in 1:50) {
    x = rep(letters[2:8], 7000)
    y = sample(x=1:49000, size=49000)
    data[i,] <- x[y]
}

data <- as.data.frame(data)
x <- c(rep('A', 20), rep('B', 15), rep('C', 15))
y <- sample(x=1:50, size=50)
class = x[y]
data <- cbind(data, class)

So, with this data set we will try to fit a model with glmnet ():

formula <- as.formula(class ~ .)
X <- model.matrix(formula, data)
model <- cv.glmnet(X, class, standardize=FALSE, family='multinomial', alpha=1, nfolds=10)

And if you do not have a computer with more RAM than mine, you will probably leak memory and give a crash in R. The solution? My first idea was to try sparse.model.matrix() that creates a sparse matrix model using the same formula. Unfortunately did not work, because even with sparse matrix, the final model is still too big! Interestingly, this dataset occupies only 24MB from RAM, but when you use the model.matrix the result is an array with more than 1Gb.

The solution I found was to build the matrix on hand. To do this we encode the array with dummy variables, column by column, and store the result in a sparse matrix. Then we will use this matrix as input to the model and see if it will not leak memory:

## Creates a matrix using the first column
X <- sparse.model.matrix(~data[,1]-1)

## Check if the column have more then one level
for (i in 2:ncol(data)) {

## In the case of more then one level apply dummy coding 
if (nlevels(data[,i])>1) {
    coluna <- sparse.model.matrix(~data[,i]-1)
    X <- cBind(X, coluna)
}
## Transform fator to numeric
else {
   coluna <- as.numeric(as.factor(data[,i]))
   X <- cBind(X, coluna)
}

NOTE: Pay attention to how we are using a sparse matrix the Matrix package is required. Also note that the columns are connected using cBind () instead of cbind ().

The matrix thus generated was much lower: less than 70 Mb when I tested. Fortunately glmnet () supports a sparse matrix and you can run the model:

mod.lasso <- cv.glmnet(X, class, standardize=FALSE, family='multinomial', alpha=1, nfolds=10)

So you can create models with this type of data without blowing the memory and without use R packages for large datasets like bigmemory and ff.

like image 196
Flavio Barros Avatar answered Sep 29 '22 11:09

Flavio Barros


To whom might be interested. I have developed an R package called biglasso which fits lasso-type models with Big Data. It works with memory-mapped (big) design matrix based on bigmemory package, and can seamlessly work for data-larger-than-RAM cases. Moreover, it's more computation- and memory-efficient as compared to glmnet by using newly proposed feature screening rules as well as better implementation. Please check the GitHub page for details, and feel free to provide any suggestions/comments.

like image 43
SixSigma Avatar answered Sep 29 '22 11:09

SixSigma