I try to fit a two-class prediction model using glmnet within caret. I incur an error when using the caret default tune grids. I don't think it is due to wrongly formatted data because, when specifying my own tuning grid, there is no problem. The error message is:
Error in loop$lambda[loop$alpha == alph[i]] <- np[which.max(np)] :
replacement has length zero
When checking the line at which the error occurs, one sees that R tries to find a maximum which.na()
over a vector np
of NA (the lambda values chosen by caret/glmnet?). I failed to debug this properly because I cannot find a way to step through each line of code after calling train()
. I hope somebody with more experience can help me out.
I created a minimal working example by making my dataset as small as possible (it started with ~200 rows and ~40 columns) while preserving the error. Note that manualModelFit
works fine but modelFit
cannot be computed:
library(caret)
library(glmnet)
# create data frame of features
var1 <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1)
var2 <- c(1,1,1,1,1,0,1,1,1,1,1,0,1,1,0,1,1)
trainData <- data.frame(v1 = var1, v2 = var2)
# create fature vector of outcomes
trainClass <- as.factor(c('event','event','event','event','event','event','event','event','event','event','nonEvent','event','event','event','event','event','nonEvent'))
# set k for k-fold CV
kInner = 5
# set randomization seed
mySeed = 1622017
# set options for caret in fitControl
fitControl <- trainControl( method = 'cv', number = kInner, classProbs = TRUE, allowParallel = FALSE, summaryFunction = twoClassSummary, verboseIter = FALSE)
# run parameter tuning with a user-specified tuning grid
set.seed(mySeed)
myTuneGrid <- expand.grid(alpha = c(0,0.5,1), lambda = c(0,0.5,1))
manualModelFit <- train(x = trainData, y = trainClass, method = 'glmnet' , trControl = fitControl, metric = 'ROC', tuneGrid = myTuneGrid)
# run default parameter tuning
set.seed(mySeed)
modelFit <- train(x = trainData, y = trainClass, method = 'glmnet' , trControl = fitControl, metric = 'ROC')
What causes the failure? Is this a bug within caret/glmnet or is this due to a property of the dataset that I overlooked? This error occurs in multiple datasets that I analyze.
Indeed, the problem is with tuneGrid
. At line 225 of train.default
there is the code
tuneGrid <- models$grid(x = x, y = y, len = tuneLength,
search = trControl$search)
which for your example gives me
alpha lambda
1 0.10 NA
2 0.55 NA
3 1.00 NA
Warning messages:
1: In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
one multinomial or binomial class has fewer than 8 observations; dangerous ground
2: from glmnet Fortran code (error code -2); Convergence for 2th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned
Obviously the NA
's for lambda result in the loop later on. models$grid
is the following function:
findGrid <- function (x, y, len = NULL, search = "grid") {
if (search == "grid") {
numLev <- if (is.character(y) | is.factor(y))
length(levels(y))
else NA
if (!is.na(numLev)) {
fam <- ifelse(numLev > 2, "multinomial", "binomial")
}
else fam <- "gaussian"
init <- glmnet(as.matrix(x), y, family = fam, nlambda = len +
2, alpha = 0.5)
lambda <- unique(init$lambda)
lambda <- lambda[-c(1, length(lambda))]
lambda <- lambda[1:min(length(lambda), len)]
out <- expand.grid(alpha = seq(0.1, 1, length = len),
lambda = lambda)
}
else {
out <- data.frame(alpha = runif(len, min = 0, 1), lambda = 2^runif(len,
min = -10, 3))
}
out
}
which I renamed to findGrid
. If you run it with findGrid(trainData, trainClass, 3)
you should get the same warning and faulty grid back. In this binary scenario, all it does is:
init <- glmnet(as.matrix(x), y, family = "binomial", nlambda = len + 2, alpha = 0.5)
lambda <- unique(init$lambda) # contains one value,
lambda <- lambda[-c(1, length(lambda))]
lambda <- lambda[1:min(length(lambda), len)]
out <- expand.grid(alpha = seq(0.1, 1, length = len),
lambda = lambda)
Now, after lambda <- unique(init$lambda)
, lambda
contains only one value which is 9.9e+35
. So whatever was intended afterwards with the indices isn't working anymore and will create NA
's instead. Increasing the number of iterations in glmnet
did not avoid the error. So let's just skip those lines and use the grid obtained, to see if that fixes the problems.
init <- glmnet(as.matrix(x), y, family = "binomial", nlambda = len + 2, alpha = 0.5)
lambda <- unique(init$lambda) # contains one value,
out <- expand.grid(alpha = seq(0.1, 1, length = len), lambda = lambda)
modelFit <- train(x = trainData, y = trainClass, method = 'glmnet' , trControl = fitControl, metric = 'ROC',
tuneGrid = out) # <-- use the tuneGrid we made
Which runs but also gives me 17 warnings, all of the form:
Warning messages:
1: In eval(expr, envir, enclos) :
model fit failed for Fold1: alpha=0.10, lambda=9.9e+35 Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
one multinomial or binomial class has 1 or 0 observations; not allowed
So you're going to have to find a way to make a proper grid. This can be done by somehow fixing glmnet
or by making some guesses / trial and error. However, I'm hesitant in looking into a method for a tune grid in this answer because it could very well be a data specific issue. A starting point would be to see if your complete dataset also has few observations in some categories.
Also, to debug this yourself, it is easiest to call View(caret:::train.default)
to see the function. :::
imports it from the hidden namespace. Next, you can copy all code into a train2
function and use browser statements to debug the code line by line (at least, that's what I did). Any other functions that R cannot find also have to be prefixed with caret:::
.
I was having this same issue, I thought I'd share my solution. As @Vandenman mentioned, you need a way of making a proper grid. This worked for me. Basically if you increase the number of lambdas you try in the init <- glmnet(...)
step, you will get at least some that do not fail. I just picked 52 (I bet this number will work for you, but you could always change it, and computation time was negligible for my case). Then you choose len
of them evenly spaced over those that did not fail.
my_glmnet <- getModelInfo("glmnet") %>% magrittr::extract2("glmnet")
my_glmnet$grid <- function (x, y, len = NULL, search = "grid") {
if (search == "grid") {
numLev <- if (is.character(y) | is.factor(y))
length(levels(y))
else NA
if (!is.na(numLev)) {
fam <- ifelse(numLev > 2, "multinomial", "binomial")
}
else fam <- "gaussian"
init <- glmnet(as.matrix(x), y, family = fam, nlambda = 52, alpha = 0.5)
lambda <- unique(init$lambda)
lambda <- lambda[-c(1, length(lambda))]
l_seq <- seq(1, length(lambda), length = len) %>% round %>% unique
lambda <- lambda[l_seq]
out <- expand.grid(alpha = seq(0.1, 1, length = len),
lambda = lambda)
}
else {
out <- data.frame(alpha = runif(len, min = 0, 1), lambda = 2^runif(len,
min = -10, 3))
}
out
}
Then you can run train
with method = my_glmnet
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With