Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is cv.glmnet overfitting the the data by using the full lambda sequence?

cv.glmnet has been used by most research papers and companies. While building a similar function like cv.glmnet for glmnet.cr (a similar package that implements the lasso for continuation ratio ordinal regression) I came across this problem in cv.glmnet.

`cv.glmnet` first fits the model:



glmnet.object = glmnet(x, y, weights = weights, offset = offset, 
                     lambda = lambda, ...)

After the glmnet object is created with the complete data, the next step goes as follows: The lambda from the complete model fitted is extracted

lambda = glmnet.object$lambda

Now they make sure number of folds is more than 3

if (nfolds < 3) 
stop("nfolds must be bigger than 3; nfolds=10 recommended")

A list is created to store cross validated results

outlist = as.list(seq(nfolds))

A for loop runs to fit different data parts per the theory of cross-validation

  for (i in seq(nfolds)) {
    which = foldid == i
    if (is.matrix(y)) 
      y_sub = y[!which, ]
    else y_sub = y[!which]
    if (is.offset) 
      offset_sub = as.matrix(offset)[!which, ]
    else offset_sub = NULL
#using the lambdas for the complete data 
    outlist[[i]] = glmnet(x[!which, , drop = FALSE], 
                          y_sub, lambda = lambda, offset = offset_sub, 
                          weights = weights[!which], ...)
  }
}

So what happens. After fitting the data to the complete data, cross-validation is done, with lambdas from the complete data. Can someone tell me how this can possibly not be data over-fitting?. We in cross-validation want the model to have no information about the left out part of the data. But cv.glmnet cheats on this!

like image 458
Chamberlain Mbah Avatar asked Nov 06 '25 09:11

Chamberlain Mbah


2 Answers

You're correct that using a cross-validated measure of fit to pick the "best" value of a tuning parameter introduces an optimistic bias into that measure when viewed as an estimate of the out-of-sample performance of the model with that "best" value. Any statistic has a sampling variance. But to talk of over-fitting seems to imply that optimization over the tuning parameter results in a degradation of out-of-sample performance compared to keeping it at a pre-specified value (say zero). That's unusual, in my experience—the optimization is very constrained (over a single parameter) compared to many other methods of feature selection. In any case it's a good idea to validate the whole procedure, including the choice of tuning parameter, on a hold-out set, or with an outer cross-validation loop, or by bootstrapping. See Cross Validation (error generalization) after model selection.

like image 107
Scortchi Avatar answered Nov 08 '25 10:11

Scortchi


No, this is not overfitting.

cv.glmnet() does build the entire solution path for the lambda sequence. But you never pick the last entry in that path. You typically pick lambda==lambda.1se (or lambda.min) , as @Fabians said:

lambda==lambda.min : is the lambda-value where cvm is minimized

lambda==lambda.1se : is the lambda-value where (cvm-cvsd)=cvlow is minimized. This is your optimal lambda

See the documentation for cv.glmnet() and coef(..., s='lambda.1se')

like image 24
smci Avatar answered Nov 08 '25 11:11

smci