I am trying to obtain the variable importance of a rf model in any way. This is the approach I have tried so far, but alternate suggestions are very welcome.
I have trained a model in R:
require(caret)
require(randomForest)
myControl = trainControl(method='cv',number=5,repeats=2,returnResamp='none')
model2 = train(increaseInAssessedLevel~., data=trainData, method = 'rf', trControl=myControl)
The dataset is fairly large, but the model runs fine. I can access its parts and run commands such as:
> model2[3]
$results
mtry RMSE Rsquared RMSESD RsquaredSD
1 2 0.1901304 0.3342449 0.004586902 0.05089500
2 61 0.1080164 0.6984240 0.006195397 0.04428158
3 120 0.1084201 0.6954841 0.007119253 0.04362755
But I get the following error:
> varImp(model2)
Error in varImp[, "%IncMSE"] : subscript out of bounds
Apparently there is supposed to be a wrapper, but that does not seem to be the case: (cf:http://www.inside-r.org/packages/cran/caret/docs/varImp)
varImp.randomForest(model2)
Error: could not find function "varImp.randomForest"
But this is particularly odd:
> traceback()
No traceback available
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] elasticnet_1.1 lars_1.2 klaR_0.6-9 MASS_7.3-26
[5] kernlab_0.9-18 nnet_7.3-6 randomForest_4.6-7 doMC_1.3.0
[9] iterators_1.0.6 caret_5.17-7 reshape2_1.2.2 plyr_1.8
[13] lattice_0.20-15 foreach_1.4.1 cluster_1.14.4
loaded via a namespace (and not attached):
[1] codetools_0.2-8 compiler_3.0.1 grid_3.0.1 stringr_0.6.2
[5] tools_3.0.1
Partial Least Squares: the variable importance measure here is based on weighted sums of the absolute regression coefficients. The weights are a function of the reduction of the sums of squares across the number of PLS components and are computed separately for each outcome.
Variable importance is determined by calculating the relative influence of each variable: whether that variable was selected to split on during the tree building process, and how much the squared error (over all trees) improved (decreased) as a result.
The varImp function tracks the changes in model statistics, such as the GCV, for each predictor and accumulates the reduction in the statistic when each predictor's feature is added to the model. This total reduction is used as the variable importance measure.
The default method to compute variable importance is the mean decrease in impurity (or gini importance) mechanism: At each split in each tree, the improvement in the split-criterion is the importance measure attributed to the splitting variable, and is accumulated over all the trees in the forest separately for each ...
The importance scores can take a while to compute and train
won't automatically get randomForest
to create them. Add importance = TRUE
to the train
call and it should work.
Max
That is becouse the obtained from train()
object is not a pure Random Forest model, but a list of different objects (containing the final model itself as well as cross-validation results etc). You may see them with ls(model2)
. So to use the final model just call varImp(model2$finalModel)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With