Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Variable importance using the caret package (error); RandomForest algorithm

I am trying to obtain the variable importance of a rf model in any way. This is the approach I have tried so far, but alternate suggestions are very welcome.

I have trained a model in R:

require(caret)
require(randomForest)
myControl = trainControl(method='cv',number=5,repeats=2,returnResamp='none')
model2 = train(increaseInAssessedLevel~., data=trainData, method = 'rf', trControl=myControl)

The dataset is fairly large, but the model runs fine. I can access its parts and run commands such as:

> model2[3]
$results
  mtry      RMSE  Rsquared      RMSESD RsquaredSD
1    2 0.1901304 0.3342449 0.004586902 0.05089500
2   61 0.1080164 0.6984240 0.006195397 0.04428158
3  120 0.1084201 0.6954841 0.007119253 0.04362755

But I get the following error:

> varImp(model2)
Error in varImp[, "%IncMSE"] : subscript out of bounds

Apparently there is supposed to be a wrapper, but that does not seem to be the case: (cf:http://www.inside-r.org/packages/cran/caret/docs/varImp)

varImp.randomForest(model2)
Error: could not find function "varImp.randomForest"

But this is particularly odd:

> traceback()
No traceback available 

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] elasticnet_1.1     lars_1.2           klaR_0.6-9         MASS_7.3-26       
 [5] kernlab_0.9-18     nnet_7.3-6         randomForest_4.6-7 doMC_1.3.0        
 [9] iterators_1.0.6    caret_5.17-7       reshape2_1.2.2     plyr_1.8          
[13] lattice_0.20-15    foreach_1.4.1      cluster_1.14.4    

loaded via a namespace (and not attached):
[1] codetools_0.2-8 compiler_3.0.1  grid_3.0.1      stringr_0.6.2  
[5] tools_3.0.1  
like image 893
Jakub Langr Avatar asked Sep 02 '13 17:09

Jakub Langr


People also ask

How is variable importance calculated in Caret?

Partial Least Squares: the variable importance measure here is based on weighted sums of the absolute regression coefficients. The weights are a function of the reduction of the sums of squares across the number of PLS components and are computed separately for each outcome.

How variable importance is calculated?

Variable importance is determined by calculating the relative influence of each variable: whether that variable was selected to split on during the tree building process, and how much the squared error (over all trees) improved (decreased) as a result.

What does varImp do in r?

The varImp function tracks the changes in model statistics, such as the GCV, for each predictor and accumulates the reduction in the statistic when each predictor's feature is added to the model. This total reduction is used as the variable importance measure.

How do you calculate variable importance in random forest?

The default method to compute variable importance is the mean decrease in impurity (or gini importance) mechanism: At each split in each tree, the improvement in the split-criterion is the importance measure attributed to the splitting variable, and is accumulated over all the trees in the forest separately for each ...


Video Answer


2 Answers

The importance scores can take a while to compute and train won't automatically get randomForest to create them. Add importance = TRUE to the train call and it should work.

Max

like image 150
topepo Avatar answered Sep 21 '22 12:09

topepo


That is becouse the obtained from train() object is not a pure Random Forest model, but a list of different objects (containing the final model itself as well as cross-validation results etc). You may see them with ls(model2). So to use the final model just call varImp(model2$finalModel) .

like image 23
O_Devinyak Avatar answered Sep 20 '22 12:09

O_Devinyak