Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Caret method = "rf" warning message: invalid ## mtry: reset to within valid range

I am working on a Coursera Machine Learning project. The goal is to perform a predictive modeling for the following dataset.

> summary(training)
   roll_belt        pitch_belt          yaw_belt       total_accel_belt  gyros_belt_x      
 Min.   :-28.90   Min.   :-55.8000   Min.   :-180.00   Min.   : 0.00    Min.   :-1.040000  
 1st Qu.:  1.10   1st Qu.:  1.7600   1st Qu.: -88.30   1st Qu.: 3.00    1st Qu.:-0.030000  
 Median :113.00   Median :  5.2800   Median : -13.00   Median :17.00    Median : 0.030000  
 Mean   : 64.41   Mean   :  0.3053   Mean   : -11.21   Mean   :11.31    Mean   :-0.005592  
 3rd Qu.:123.00   3rd Qu.: 14.9000   3rd Qu.:  12.90   3rd Qu.:18.00    3rd Qu.: 0.110000  
 Max.   :162.00   Max.   : 60.3000   Max.   : 179.00   Max.   :29.00    Max.   : 2.220000  
  gyros_belt_y       gyros_belt_z      accel_belt_x       accel_belt_y     accel_belt_z     magnet_belt_x  
 Min.   :-0.64000   Min.   :-1.4600   Min.   :-120.000   Min.   :-69.00   Min.   :-275.00   Min.   :-52.0  
 1st Qu.: 0.00000   1st Qu.:-0.2000   1st Qu.: -21.000   1st Qu.:  3.00   1st Qu.:-162.00   1st Qu.:  9.0  
 Median : 0.02000   Median :-0.1000   Median : -15.000   Median : 35.00   Median :-152.00   Median : 35.0  
 Mean   : 0.03959   Mean   :-0.1305   Mean   :  -5.595   Mean   : 30.15   Mean   : -72.59   Mean   : 55.6  
 3rd Qu.: 0.11000   3rd Qu.:-0.0200   3rd Qu.:  -5.000   3rd Qu.: 61.00   3rd Qu.:  27.00   3rd Qu.: 59.0  
 Max.   : 0.64000   Max.   : 1.6200   Max.   :  85.000   Max.   :164.00   Max.   : 105.00   Max.   :485.0  
 magnet_belt_y   magnet_belt_z       roll_arm         pitch_arm          yaw_arm          total_accel_arm
 Min.   :354.0   Min.   :-623.0   Min.   :-180.00   Min.   :-88.800   Min.   :-180.0000   Min.   : 1.00  
 1st Qu.:581.0   1st Qu.:-375.0   1st Qu.: -31.77   1st Qu.:-25.900   1st Qu.: -43.1000   1st Qu.:17.00  
 Median :601.0   Median :-320.0   Median :   0.00   Median :  0.000   Median :   0.0000   Median :27.00  
 Mean   :593.7   Mean   :-345.5   Mean   :  17.83   Mean   : -4.612   Mean   :  -0.6188   Mean   :25.51  
 3rd Qu.:610.0   3rd Qu.:-306.0   3rd Qu.:  77.30   3rd Qu.: 11.200   3rd Qu.:  45.8750   3rd Qu.:33.00  
 Max.   :673.0   Max.   : 293.0   Max.   : 180.00   Max.   : 88.500   Max.   : 180.0000   Max.   :66.00  
  gyros_arm_x        gyros_arm_y       gyros_arm_z       accel_arm_x       accel_arm_y    
 Min.   :-6.37000   Min.   :-3.4400   Min.   :-2.3300   Min.   :-404.00   Min.   :-318.0  
 1st Qu.:-1.33000   1st Qu.:-0.8000   1st Qu.:-0.0700   1st Qu.:-242.00   1st Qu.: -54.0  
 Median : 0.08000   Median :-0.2400   Median : 0.2300   Median : -44.00   Median :  14.0  
 Mean   : 0.04277   Mean   :-0.2571   Mean   : 0.2695   Mean   : -60.24   Mean   :  32.6  
 3rd Qu.: 1.57000   3rd Qu.: 0.1400   3rd Qu.: 0.7200   3rd Qu.:  84.00   3rd Qu.: 139.0  
 Max.   : 4.87000   Max.   : 2.8400   Max.   : 3.0200   Max.   : 437.00   Max.   : 308.0  
  accel_arm_z       magnet_arm_x     magnet_arm_y     magnet_arm_z    roll_dumbbell     pitch_dumbbell   
 Min.   :-636.00   Min.   :-584.0   Min.   :-392.0   Min.   :-597.0   Min.   :-153.71   Min.   :-149.59  
 1st Qu.:-143.00   1st Qu.:-300.0   1st Qu.:  -9.0   1st Qu.: 131.2   1st Qu.: -18.49   1st Qu.: -40.89  
 Median : -47.00   Median : 289.0   Median : 202.0   Median : 444.0   Median :  48.17   Median : -20.96  
 Mean   : -71.25   Mean   : 191.7   Mean   : 156.6   Mean   : 306.5   Mean   :  23.84   Mean   : -10.78  
 3rd Qu.:  23.00   3rd Qu.: 637.0   3rd Qu.: 323.0   3rd Qu.: 545.0   3rd Qu.:  67.61   3rd Qu.:  17.50  
 Max.   : 292.00   Max.   : 782.0   Max.   : 583.0   Max.   : 694.0   Max.   : 153.55   Max.   : 149.40  
  yaw_dumbbell      total_accel_dumbbell gyros_dumbbell_x    gyros_dumbbell_y   gyros_dumbbell_z 
 Min.   :-150.871   Min.   : 0.00        Min.   :-204.0000   Min.   :-2.10000   Min.   : -2.380  
 1st Qu.: -77.644   1st Qu.: 4.00        1st Qu.:  -0.0300   1st Qu.:-0.14000   1st Qu.: -0.310  
 Median :  -3.324   Median :10.00        Median :   0.1300   Median : 0.03000   Median : -0.130  
 Mean   :   1.674   Mean   :13.72        Mean   :   0.1611   Mean   : 0.04606   Mean   : -0.129  
 3rd Qu.:  79.643   3rd Qu.:19.00        3rd Qu.:   0.3500   3rd Qu.: 0.21000   3rd Qu.:  0.030  
 Max.   : 154.952   Max.   :58.00        Max.   :   2.2200   Max.   :52.00000   Max.   :317.000  
 accel_dumbbell_x  accel_dumbbell_y  accel_dumbbell_z  magnet_dumbbell_x magnet_dumbbell_y
 Min.   :-419.00   Min.   :-189.00   Min.   :-334.00   Min.   :-643.0    Min.   :-3600    
 1st Qu.: -50.00   1st Qu.:  -8.00   1st Qu.:-142.00   1st Qu.:-535.0    1st Qu.:  231    
 Median :  -8.00   Median :  41.50   Median :  -1.00   Median :-479.0    Median :  311    
 Mean   : -28.62   Mean   :  52.63   Mean   : -38.32   Mean   :-328.5    Mean   :  221    
 3rd Qu.:  11.00   3rd Qu.: 111.00   3rd Qu.:  38.00   3rd Qu.:-304.0    3rd Qu.:  390    
 Max.   : 235.00   Max.   : 315.00   Max.   : 318.00   Max.   : 592.0    Max.   :  633    
 magnet_dumbbell_z  roll_forearm       pitch_forearm     yaw_forearm      total_accel_forearm
 Min.   :-262.00   Min.   :-180.0000   Min.   :-72.50   Min.   :-180.00   Min.   :  0.00     
 1st Qu.: -45.00   1st Qu.:  -0.7375   1st Qu.:  0.00   1st Qu.: -68.60   1st Qu.: 29.00     
 Median :  13.00   Median :  21.7000   Median :  9.24   Median :   0.00   Median : 36.00     
 Mean   :  46.05   Mean   :  33.8265   Mean   : 10.71   Mean   :  19.21   Mean   : 34.72     
 3rd Qu.:  95.00   3rd Qu.: 140.0000   3rd Qu.: 28.40   3rd Qu.: 110.00   3rd Qu.: 41.00     
 Max.   : 452.00   Max.   : 180.0000   Max.   : 89.80   Max.   : 180.00   Max.   :108.00     
 gyros_forearm_x   gyros_forearm_y     gyros_forearm_z    accel_forearm_x   accel_forearm_y 
 Min.   :-22.000   Min.   : -7.02000   Min.   : -8.0900   Min.   :-498.00   Min.   :-632.0  
 1st Qu.: -0.220   1st Qu.: -1.46000   1st Qu.: -0.1800   1st Qu.:-178.00   1st Qu.:  57.0  
 Median :  0.050   Median :  0.03000   Median :  0.0800   Median : -57.00   Median : 201.0  
 Mean   :  0.158   Mean   :  0.07517   Mean   :  0.1512   Mean   : -61.65   Mean   : 163.7  
 3rd Qu.:  0.560   3rd Qu.:  1.62000   3rd Qu.:  0.4900   3rd Qu.:  76.00   3rd Qu.: 312.0  
 Max.   :  3.970   Max.   :311.00000   Max.   :231.0000   Max.   : 477.00   Max.   : 923.0  
 accel_forearm_z   magnet_forearm_x  magnet_forearm_y magnet_forearm_z classe  
 Min.   :-446.00   Min.   :-1280.0   Min.   :-896.0   Min.   :-973.0   A:5580  
 1st Qu.:-182.00   1st Qu.: -616.0   1st Qu.:   2.0   1st Qu.: 191.0   B:3797  
 Median : -39.00   Median : -378.0   Median : 591.0   Median : 511.0   C:3422  
 Mean   : -55.29   Mean   : -312.6   Mean   : 380.1   Mean   : 393.6   D:3216  
 3rd Qu.:  26.00   3rd Qu.:  -73.0   3rd Qu.: 737.0   3rd Qu.: 653.0   E:3607  
 Max.   : 291.00   Max.   :  672.0   Max.   :1480.0   Max.   :1090.0           

For training the model, I did the following:

trainCtrl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
rfModel <- train(classe ~., method = "rf", trControl = trainCtrl, preProcess = "pca", data = training, prox = TRUE)

The model worked. However, I was rather annoyed by multiple warning messages, repeated up to 20 times, invalid mtry: reset to within valid range. A few searches on Google did not return any useful insights. Also, not sure it matters, there were no NA values in the dataset; they were removed in a prior step.

I also ran system.time(), the processing time was awfully more than 1 hour.

> system.time(train(classe ~., method = "rf", trControl = trainCtrl, preProcess = "pca", data = training, prox = TRUE))
    user   system  elapsed 
6478.113  302.281 7044.483 

If you can help decipher the what and why this warning message, that would be super. I would love to hear any comments regarding such a long processing time.

Thank you!

like image 965
useryk Avatar asked Mar 09 '18 03:03

useryk


1 Answers

The caret rf method uses the randomForest function from the randomForest package. If you set the mtry argument of randomForest to a value greater than the number of predictor variables, you'll get the warning you posted (for example, try rf = randomForest(mpg ~ ., mtry=15, data=mtcars)). The model still runs, but randomForest sets mtry to a lower, valid value.

The question is, why is train (or one of the functions it calls) feeding randomForest an mtry value that's too large? I'm not sure, but here's a guess: Setting preProcess="pca" reduces the number of features being fed to randomForest (relative to the number of features in the raw data), because the least important principal components are discarded to reduce the dimensionality of the feature set. However, when doing cross-validation, it's possible that train nevertheless sets the maximum mtry value for randomForest based on the larger number of features in the raw data, rather than based on the pre-processed data set that's actually fed to randomForest. Circumstantial evidence for this is that the warning goes away if you remove the preProcess="pca" argument, but I didn't check any further than that.

Reproducible code showing that the warning goes away without pca:

trainCtrl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
rfModel <- train(mpg ~., method = "rf", trControl = trainCtrl, preProcess = "pca", data = mtcars, prox = TRUE)
rfModel <- train(mpg ~., method = "rf", trControl = trainCtrl, data = mtcars, prox = TRUE)
like image 119
eipi10 Avatar answered Nov 17 '22 08:11

eipi10