I'm trying to run an elastic net. Starting with LASSO and going from there. I can get it to run directly but it fails when I try to run the same parameters using train
in the caret
package. I'd like to get train
working so that I can use it to evaluate model parameters.
# Works
test <- enet( x=x, y=y, lambda=0, trace=TRUE, normalize=FALSE, intercept=FALSE )
# Doesn't
enetGrid <- data.frame(.lambda=0,.fraction=c(.01,.001,.0005,.0001))
ctrl <- trainControl( method="repeatedcv", repeats=5 )
> test2 <- train( x, y, method="enet", tuneGrid=enetGrid, trControl=ctrl, preProc=NULL )
fraction lambda RMSE Rsquared RMSESD RsquaredSD
1 1e-04 0 NaN NaN NA NA
2 5e-04 0 NaN NaN NA NA
3 1e-03 0 NaN NaN NA NA
4 1e-02 0 NaN NaN NA NA
Error in train.default(x, y, method = "enet", tuneGrid = enetGrid, trControl = ctrl, :
final tuning parameters could not be determined
In addition: There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
...
50: In eval(expr, envir, enclos) :
model fit failed for Fold10.Rep5: lambda=0, fraction=0.01 Error in enet(as.matrix(trainX), trainY, lambda = lmbda) :
Some of the columns of x have zero variance
Note that any collinearity in the above example is just a result of subsetting down for a reproducible example (1,000 rows vs. 208,000 in the real dataset).
I've checked the full dataset in various ways, including findLinearCombos
. Note that a few hundred of the variables are dummied out from clinical diagnoses and thus are binary with a low proportion of 1's.
How do I get train(...,method="enet") to run using the exact same settings as
enet()`?
Data for reproducibility, sesionInfo, etc.
Sample data x
and y
are available here.
Results of sessionInfo()
:
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=C LC_PAPER=C
[8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C
attached base packages:
[1] parallel splines grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] scales_0.2.3 elasticnet_1.1 fscaret_0.8.5.3 gsubfn_0.6-5 proto_0.3-10 lars_1.2 caret_5.17-7
[8] foreach_1.4.1 cluster_1.14.4 lubridate_1.3.0 HH_2.3-37 reshape_0.8.4 latticeExtra_0.6-24 leaps_2.9
[15] multcomp_1.2-18 perturb_2.05 Zelig_4.2-0 sandwich_2.2-10 zoo_1.7-10 survey_3.29-5 Hmisc_3.12-2
[22] survival_2.37-4 lme4_0.999999-2 bayesm_2.2-5 stargazer_4.0 pscl_1.04.4 vcd_1.2-13 colorspace_1.2-2
[29] mvtnorm_0.9-9995 car_2.0-18 nnet_7.3-7 gdata_2.13.2 gtools_3.0.0 spBayes_0.3-7 Formula_1.1-1
[36] magic_1.5-4 abind_1.4-0 MapGAM_0.6-2 gam_1.08 fields_6.7.6 maps_2.3-2 spam_0.29-3
[43] FNN_1.0 spatstat_1.31-3 mgcv_1.7-24 rgeos_0.2-19 RArcInfo_0.4-12 automap_1.0-12 gstat_1.0-16
[50] SDMTools_1.1-13 rgdal_0.8-10 spdep_0.5-60 coda_0.16-1 deldir_0.0-22 maptools_0.8-25 nlme_3.1-110
[57] MASS_7.3-27 Matrix_1.0-12 lattice_0.20-15 boot_1.3-9 data.table_1.8.8 xtable_1.7-1 RCurl_1.95-4.1
[64] bitops_1.0-5 RColorBrewer_1.0-5 testthat_0.7.1 codetools_0.2-8 devtools_1.3 stringr_0.6.2 foreign_0.8-54
[71] ggplot2_0.9.3.1 sp_1.0-11 taRifx_1.0.5 reshape2_1.2.2 plyr_1.8 functional_0.4 R.utils_1.25.2
[78] R.oo_1.13.9 R.methodsS3_1.4.4
loaded via a namespace (and not attached):
[1] LearnBayes_2.12 compiler_3.0.1 dichromat_2.0-0 digest_0.6.3 evaluate_0.4.4 gtable_0.1.2 httr_0.2 intervals_0.14.0 iterators_1.0.6
[10] labeling_0.2 memoise_0.1 munsell_0.4.2 rpart_4.1-1 spacetime_1.0-5 stats4_3.0.1 tcltk_3.0.1 tools_3.0.1 whisker_0.3-2
[19] xts_0.9-5
Update
Run on a 15% sample of the dataset:
Warning in eval(expr, envir, enclos) :
model fit failed for Fold10.Rep1: lambda=0, fraction=0.005
... (more of the same warning messages) ...
Warning in nominalTrainWorkflow(dat = trainData, info = trainInfo, method = met\
hod, :
There were missing values in resampled performance measures.
Error in if (lambda > 0) { : argument is of length zero
Calls: train ... train.default -> system.time -> createModel -> enet
There are 806 columns of the X matrix, 801 of them dummies. Many of these dummies are extremely sparse (1-3 observations out of 25k or so rows), others have 0.1-5% of their values as TRUE. In total, there are 108867 TRUE's and 21mm FALSE's.
Just to bring some resolution to this, I have it working now. I dropped all columns with less than 20 TRUE
's (remember, this is out of almost 200k observations) as simply having insufficient information to contribute. This wound up being about half of them.
I will have to be cautious that these sparse columns don't contribute too much bias, etc. as I move forward, but I am hoping that by using a method that does variable selection (lasso, RF, etc.) that will be less of a problem.
Thanks to @O_Devinyak for the help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With