Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

error tuning SVM in R

I'm tuning an SVM in R and I receive the following error:

#Error in if (any(co)) { : missing value where TRUE/FALSE needed

I'm using caret package

svmRTune <- train(x=dataTrain[,predModelContinuous],y=dataTrain[,outcome],method = "svmRadial", tuneLength = 14, trControl = trCtrl)

the training set structure is

str(dataTrain)
'data.frame':   40001 obs. of  42 variables:
 $ PolNum     : num  2e+08 2e+08 2e+08 2e+08 2e+08 ...
 $ sex        : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 2 1 1 1 2 ...
 $ type       : Factor w/ 6 levels "A","B","C","D",..: 3 1 1 2 2 4 3 3 3 2 ...
 $ catgry     : Ord.factor w/ 3 levels "Large"<"Medium"<..: 2 2 2 3 3 3 3 2 2 2 ...
 $ occup      : Factor w/ 5 levels "Employed","Housewife",..: 2 1 1 1 5 4 1 1 4 2 ...
 $ age        : num  48 23 23 39 24 39 28 43 45 38 ...
 $ group      : Factor w/ 20 levels "1","2","3","4",..: 15 16 12 16 14 8 16 9 12 8 ...
 $ bonus      : Ord.factor w/ 21 levels "-50"<"-40"<"-30"<..: 14 8 4 3 5 2 5 5 1 15 ...
 $ poldur     : num  7 1 1 14 2 4 11 2 8 5 ...
 $ value      : num  1120 21755 18430 11930 24850 ...
 $ adind      : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 2 2 2 1 1 ...
 $ Pcode      : chr  "SC22" "CT109" "MA1" "SA12" ...
 $ Area       : Factor w/ 10 levels "CT","JU","MA",..: 7 1 3 6 6 6 6 4 1 2 ...
 $ Density    : num  270.5 57.3 43.2 167.9 169.8 ...
 $ Prem       : num  1159 532 527 197 908 ...
 $ Premad     : num  53.1 413.7 410.7 61.6 824.6 ...
 $ numclm     : num  0 1 0 1 0 0 0 1 0 0 ...
 $ Invite     : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Renewaltp  : num  1302 928 632 291 960 ...
 $ Renewalad  : num  58.4 599 440.4 71.3 682 ...
 $ Markettp   : num  1110 884 565 253 833 ...
 $ Marketad   : num  53.4 611.4 431.6 55.5 587 ...
 $ Premtot    : num  1212 532 527 259 908 ...
 $ Renewaltot : num  1361 928 632 362 960 ...
 $ Markettot  : num  1163 884 565 309 833 ...
 $ Renew      : Ord.factor w/ 2 levels "No"<"Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ Premchng   : num  1.12 1.74 1.2 1.4 1.06 ...
 $ Compmeas   : num  1.17 1.05 1.12 1.17 1.15 ...
 $ numclmRec  : Ord.factor w/ 3 levels "None"<"One"<"Two or more": 1 2 1 2 1 1 1 2 1 1 ...
 $ PremChngRec: Factor w/ 20 levels "[0.546,0.758)",..: 16 20 18 19 14 3 7 19 17 11 ...
 $ ageRec     : Factor w/ 20 levels "[19,22)","[22,25)",..: 14 2 2 9 2 9 4 11 12 9 ...
 $ valueRec   : Factor w/ 20 levels "[ 1005, 3290)",..: 1 15 13 9 17 5 12 12 19 1 ...
 $ densityRec : Factor w/ 20 levels "[ 14.4, 25.0)",..: 19 6 5 15 15 13 15 1 5 11 ...
 $ CompmeasRec: Factor w/ 20 levels "[0.716,0.869)",..: 12 6 10 13 12 18 11 16 18 14 ...
 $ poldurRec  : Ord.factor w/ 16 levels "1"<"2"<"3"<"4"<..: 7 1 1 14 2 4 11 2 8 5 ...
 $ ageST      : num  0.407 -1.34 -1.34 -0.222 -1.27 ...
 $ numclmST   : num  -0.433 1.627 -0.433 1.627 -0.433 ...
 $ PremchngST : num  0.591 3.709 0.98 1.985 0.265 ...
 $ valueST    : num  -1.462 0.499 0.183 -0.434 0.793 ...
 $ DensityST  : num  1.918 -0.748 -0.924 0.636 0.659 ...
 $ CompmeasST : num  0.224 -0.539 -0.098 0.248 0.113 ...
 $ poldurST   : num  0.097 -1.2 -1.2 1.61 -0.984 ...

and

sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:

[1] LC_COLLATE=Italian_Italy.1252  LC_CTYPE=Italian_Italy.1252   
[3] LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C                  
[5] LC_TIME=Italian_Italy.1252  

attached base packages:

 [1] parallel  splines   grid      stats     graphics  grDevices utils    
 [8] datasets  methods   base  

other attached packages:

 [1] C50_0.1.0-16       kernlab_0.9-19     nnet_7.3-8         plyr_1.8.1        
 [5] gbm_2.1            randomForest_4.6-7 rpart_4.1-8        klaR_0.6-10       
 [9] MASS_7.3-31        doParallel_1.0.8   iterators_1.0.6    foreach_1.4.1     
[13] pROC_1.7.1         mda_0.4-4          class_7.3-10       earth_3.2-7       
[17] plotrix_3.5-5      plotmo_1.3-3       Formula_1.1-1      survival_2.37-7   
[21] caret_6.0-24       ggplot2_0.9.3.1    lattice_0.20-29    rj_1.1.3-1        

loaded via a namespace (and not attached):

 [1] car_2.0-19          cluster_1.15.2      codetools_0.2-8    
 [4] colorspace_1.2-4    combinat_0.0-8      compiler_3.0.2     
 [7] dichromat_2.0-0     digest_0.6.4        gtable_0.1.2       
[10] Hmisc_3.14-3        labeling_0.2        latticeExtra_0.6-26
[13] munsell_0.4.2       proto_0.3-10        RColorBrewer_1.0-5 
[16] Rcpp_0.11.1         reshape2_1.2.2      rj.gd_1.1.3-1      
[19] scales_0.2.3        stringr_0.6.2       tools_3.0.2    
like image 258
Giorgio Spedicato Avatar asked Apr 06 '14 11:04

Giorgio Spedicato


Video Answer


2 Answers

Just posting in case anyone else runs across this problem. It appears to be caused by including a factor or character variable in your training data set.

Why svm can not take a factor variable, I do not know. I replaced my factors with hand coded dummies, and it worked fine, but the approach was too inelegant to document.

like image 100
Dan Brown Avatar answered Nov 10 '22 11:11

Dan Brown


I can confirm Dan Brown's answer, the error seem to be caused by having factors in the data. I wrote the following code to turn factors into dummy variables. It is not especially pretty but it does the job.

library("foreach")

# Helper function, use the other one
# takes a column name (pointing to a factor variable) and a dataset 
# returns a dataframe containing a 1-in-K coding for this factor variable
col_to_dummy <- function(colname, data) {
  # tmp is a dataframe of K columns, where K is the number of levels of the factor in colname
  # it is a 1-in-K dummy variable coding
  levelnames <- levels(data[[colname]])
  dummy <- foreach(i=1:length(levelnames), .combine=cbind) %do% {
    as.numeric(as.numeric(data[[colname]])==i)
  }
  dummy <- as.data.frame(dummy)
  names(dummy) <- paste0(colname, ":", levelnames)
  dummy
}

factor_to_dummy <- function(obsdata) {

  # finding the columns containing a factor variable
  col_factor <- unlist(lapply(FUN=is.factor, obsdata))

  # if they are none, then nothing to do
  if(!any(col_factor)) {
    return(obsdata)
  }
  # otherwise
  # for each of these, convert it to dummy variables using col_to_dummy
  foreach(colname=names(which(col_factor)), .combine = cbind, 
                     .init = obsdata[,-which(col_factor)]) %do% {
                       col_to_dummy(colname, obsdata)
                     }
  # each resulting data.frame is c-bound with the dataset without factors
}

Some solution out there use model.matrix, but realize that by default, model.matrix uses a reference level (intercept) and then use a 1-of-(K-1) coding scheme for all factors. You will need to tinker with the contrast arguments to maybe get what you want.

This code is really easy to use. Once the function definitions have been ran, you can simply do:

df_with_dummy_vars <- factor_to_dummy(original_df)

All factor columns will be converted to dummy variables.

like image 38
asachet Avatar answered Nov 10 '22 12:11

asachet