Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Columns not available for when training lasso model using caret

I am getting an odd error

Error in `[.data.frame`(data, , lvls[1]) : undefined columns selected

message when I am using caret to train a glmnet model. I have used basically the same code and the same predictors for an ordinal model (just with a different factor ythen) and it worked fine. It took 400 core hours to compute so I cant show it here though).

#Source a small subset of data
source("https://gist.githubusercontent.com/FredrikKarlssonSpeech/ebd9fccf1de6789a3f529cafc496a90c/raw/efc130e41c7d01d972d1c69e59bf8f5f5fea58fa/voice.R")
trainIndex <- createDataPartition(notna$RC, p = .75, 
                                  list = FALSE, 
                                  times = 1)


training <- notna[ trainIndex[,1],] %>%
  select(RC,FCoM_envel:ATrPS_freq,`Jitter->F0_abs_dif`:RPDE)
testing  <- notna[-trainIndex[,1],] %>%
  select(RC,FCoM_envel:ATrPS_freq,`Jitter->F0_abs_dif`:RPDE)

fitControl <- trainControl(## 10-fold CV
  method = "CV",
  number = 10,
  allowParallel=TRUE,
  savePredictions="final",
  summaryFunction=twoClassSummary)

vtCVFit <- train(x=training[-1],y=training[,"RC"], 
                  method = "glmnet", 
                  trControl = fitControl,
                  preProcess=c("center", "scale"),
                  metric="Kappa"
)

I cant find anything obviously wrong with the data. No NAs

table(is.na(training))

FALSE 
43166

and dont see why it would try to index outside of the number of columns.

Any suggestions?

like image 904
Fredrik Karlsson Avatar asked Sep 04 '18 17:09

Fredrik Karlsson


2 Answers

You have to remove summaryFunction=twoClassSummary in your trainControl(). It works for me.

fitControl <- trainControl(## 10-fold CV
 method = "CV",
 number = 10,
 allowParallel=TRUE,
 savePredictions="final")

vtCVFit <- train(x=training[-1],y=training[,"RC"], 
method = "glmnet", 
 trControl = fitControl,
preProcess=c("center", "scale"),
metric="Kappa")

 print(vtCVFit)

#glmnet 

#113 samples
#381 predictors
#  2 classes: 'NVT', 'VT' 

#Pre-processing: centered (381), scaled (381) 
#Resampling: Bootstrapped (25 reps) 
#Summary of sample sizes: 113, 113, 113, 113, 113, 113, ... 
#Resampling results across tuning parameters:

#  alpha  lambda      Accuracy   Kappa    
#  0.10   0.01113752  0.5778182  0.1428393
#  0.10   0.03521993  0.5778182  0.1428393
#  0.10   0.11137520  0.5778182  0.1428393
#  0.55   0.01113752  0.5778182  0.1428393
#  0.55   0.03521993  0.5748248  0.1407333
#  0.55   0.11137520  0.5749980  0.1136131
#  1.00   0.01113752  0.5815391  0.1531280
#  1.00   0.03521993  0.5800217  0.1361240
#  1.00   0.11137520  0.5939621  0.1158007

#Kappa was used to select the optimal model using the largest value.
#The final values used for the model were alpha = 1 and lambda = 0.01113752.
like image 96
Alex Yahiaoui Martinez Avatar answered Sep 21 '22 21:09

Alex Yahiaoui Martinez


Change your factors to character by the following code and see if it works:

      training <- data.frame(lapply(training , as.character), stringsAsFactors=FALSE)

I would have left this suggestion as a comment but I wasn't able to do it (since I have less than 50 reputations!)

like image 30
Shirin Yavari Avatar answered Sep 23 '22 21:09

Shirin Yavari