Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Automatic caret parameter tuning fails in glmnet

Tags:

r

r-caret

glmnet

Context and error message

I try to fit a two-class prediction model using glmnet within caret. I incur an error when using the caret default tune grids. I don't think it is due to wrongly formatted data because, when specifying my own tuning grid, there is no problem. The error message is:

Error in loop$lambda[loop$alpha == alph[i]] <- np[which.max(np)] : 
replacement has length zero

When checking the line at which the error occurs, one sees that R tries to find a maximum which.na() over a vector np of NA (the lambda values chosen by caret/glmnet?). I failed to debug this properly because I cannot find a way to step through each line of code after calling train(). I hope somebody with more experience can help me out.

Minimal working example

I created a minimal working example by making my dataset as small as possible (it started with ~200 rows and ~40 columns) while preserving the error. Note that manualModelFit works fine but modelFit cannot be computed:

library(caret)
library(glmnet)
# create data frame of features
var1 <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1)
var2 <- c(1,1,1,1,1,0,1,1,1,1,1,0,1,1,0,1,1)
trainData <- data.frame(v1 = var1, v2 = var2)
# create fature vector of outcomes
trainClass <- as.factor(c('event','event','event','event','event','event','event','event','event','event','nonEvent','event','event','event','event','event','nonEvent'))
# set k for k-fold CV
kInner = 5
# set randomization seed
mySeed = 1622017
# set options for caret in fitControl
fitControl <- trainControl( method = 'cv', number = kInner, classProbs = TRUE, allowParallel = FALSE, summaryFunction = twoClassSummary, verboseIter = FALSE)
# run parameter tuning with a user-specified tuning grid
set.seed(mySeed)
myTuneGrid <- expand.grid(alpha = c(0,0.5,1), lambda = c(0,0.5,1))
manualModelFit <- train(x = trainData, y = trainClass, method = 'glmnet' , trControl = fitControl, metric = 'ROC', tuneGrid = myTuneGrid)
# run default parameter tuning
set.seed(mySeed)
modelFit <- train(x = trainData, y = trainClass, method = 'glmnet' , trControl = fitControl, metric = 'ROC')

The questions

What causes the failure? Is this a bug within caret/glmnet or is this due to a property of the dataset that I overlooked? This error occurs in multiple datasets that I analyze.

like image 729
workingonit Avatar asked Feb 27 '17 14:02

workingonit


2 Answers

Indeed, the problem is with tuneGrid. At line 225 of train.default there is the code

tuneGrid <- models$grid(x = x, y = y, len = tuneLength, 
            search = trControl$search)

which for your example gives me

  alpha lambda
1  0.10     NA
2  0.55     NA
3  1.00     NA
Warning messages:
1: In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  :
  one multinomial or binomial class has fewer than 8  observations; dangerous ground
2: from glmnet Fortran code (error code -2); Convergence for 2th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned 

Obviously the NA's for lambda result in the loop later on. models$grid is the following function:

findGrid <- function (x, y, len = NULL, search = "grid") {
    if (search == "grid") {
        numLev <- if (is.character(y) | is.factor(y)) 
            length(levels(y))
        else NA
        if (!is.na(numLev)) {
            fam <- ifelse(numLev > 2, "multinomial", "binomial")
        }
        else fam <- "gaussian"
        init <- glmnet(as.matrix(x), y, family = fam, nlambda = len + 
                        2, alpha = 0.5)
        lambda <- unique(init$lambda)
        lambda <- lambda[-c(1, length(lambda))]
        lambda <- lambda[1:min(length(lambda), len)]
        out <- expand.grid(alpha = seq(0.1, 1, length = len), 
                           lambda = lambda)
    }
    else {
        out <- data.frame(alpha = runif(len, min = 0, 1), lambda = 2^runif(len, 
                                                                           min = -10, 3))
    }
    out
}

which I renamed to findGrid. If you run it with findGrid(trainData, trainClass, 3) you should get the same warning and faulty grid back. In this binary scenario, all it does is:

init <- glmnet(as.matrix(x), y, family = "binomial", nlambda = len + 2, alpha = 0.5)
lambda <- unique(init$lambda) # contains one value, 
lambda <- lambda[-c(1, length(lambda))]
lambda <- lambda[1:min(length(lambda), len)]
out <- expand.grid(alpha = seq(0.1, 1, length = len), 
                   lambda = lambda)

Now, after lambda <- unique(init$lambda), lambda contains only one value which is 9.9e+35. So whatever was intended afterwards with the indices isn't working anymore and will create NA's instead. Increasing the number of iterations in glmnet did not avoid the error. So let's just skip those lines and use the grid obtained, to see if that fixes the problems.

init <- glmnet(as.matrix(x), y, family = "binomial", nlambda = len + 2, alpha = 0.5)
lambda <- unique(init$lambda) # contains one value, 
out <- expand.grid(alpha = seq(0.1, 1, length = len), lambda = lambda)
modelFit <- train(x = trainData, y = trainClass, method = 'glmnet' , trControl = fitControl, metric = 'ROC', 
                  tuneGrid = out) # <-- use the tuneGrid we made

Which runs but also gives me 17 warnings, all of the form:

Warning messages:
1: In eval(expr, envir, enclos) :
  model fit failed for Fold1: alpha=0.10, lambda=9.9e+35 Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  : 
  one multinomial or binomial class has 1 or 0 observations; not allowed

So you're going to have to find a way to make a proper grid. This can be done by somehow fixing glmnet or by making some guesses / trial and error. However, I'm hesitant in looking into a method for a tune grid in this answer because it could very well be a data specific issue. A starting point would be to see if your complete dataset also has few observations in some categories.

Also, to debug this yourself, it is easiest to call View(caret:::train.default) to see the function. ::: imports it from the hidden namespace. Next, you can copy all code into a train2 function and use browser statements to debug the code line by line (at least, that's what I did). Any other functions that R cannot find also have to be prefixed with caret:::.

like image 134
Vandenman Avatar answered Nov 14 '22 02:11

Vandenman


I was having this same issue, I thought I'd share my solution. As @Vandenman mentioned, you need a way of making a proper grid. This worked for me. Basically if you increase the number of lambdas you try in the init <- glmnet(...) step, you will get at least some that do not fail. I just picked 52 (I bet this number will work for you, but you could always change it, and computation time was negligible for my case). Then you choose len of them evenly spaced over those that did not fail.

my_glmnet <- getModelInfo("glmnet") %>% magrittr::extract2("glmnet")
my_glmnet$grid <- function (x, y, len = NULL, search = "grid") {
  if (search == "grid") {
    numLev <- if (is.character(y) | is.factor(y)) 
      length(levels(y))
    else NA
    if (!is.na(numLev)) {
      fam <- ifelse(numLev > 2, "multinomial", "binomial")
    }
    else fam <- "gaussian"
    init <- glmnet(as.matrix(x), y, family = fam, nlambda = 52, alpha = 0.5)
    lambda <- unique(init$lambda)
    lambda <- lambda[-c(1, length(lambda))]
    l_seq <- seq(1, length(lambda), length = len) %>% round %>% unique
    lambda <- lambda[l_seq]
    out <- expand.grid(alpha = seq(0.1, 1, length = len), 
                       lambda = lambda)
  }
  else {
    out <- data.frame(alpha = runif(len, min = 0, 1), lambda = 2^runif(len, 
                                                                       min = -10, 3))
  }
  out
}

Then you can run train with method = my_glmnet.

like image 26
Jonathan Gellar Avatar answered Nov 14 '22 03:11

Jonathan Gellar