I do not understand which is the difference between varImp
function (caret
package) and importance
function (randomForest
package) for a Random Forest model:
I computed a simple RF classification model and when computing variable importance, I found that the "ranking" of predictors was not the same for both functions:
Here is my code:
rfImp <- randomForest(Origin ~ ., data = TAll_CS,
ntree = 2000,
importance = TRUE)
importance(rfImp)
BREAST LUNG MeanDecreaseAccuracy MeanDecreaseGini
Energy_GLCM_R1SC4NG3 -1.44116806 2.8918537 1.0929302 0.3712622
Contrast_GLCM_R1SC4NG3 -2.61146974 1.5848150 -0.4455327 0.2446930
Entropy_GLCM_R1SC4NG3 -3.42017102 3.8839464 0.9779201 0.4170445
...
varImp(rfImp)
BREAST LUNG
Energy_GLCM_R1SC4NG3 0.72534283 0.72534283
Contrast_GLCM_R1SC4NG3 -0.51332737 -0.51332737
Entropy_GLCM_R1SC4NG3 0.23188771 0.23188771
...
I thought they used the same "algorithm" but I am not sure now.
EDIT
In order to reproduce the problem, the ionosphere
dataset (kknn package) can be used:
library(kknn)
data(ionosphere)
rfImp <- randomForest(class ~ ., data = ionosphere[,3:35],
ntree = 2000,
importance = TRUE)
importance(rfImp)
b g MeanDecreaseAccuracy MeanDecreaseGini
V3 21.3106205 42.23040 42.16524 15.770711
V4 10.9819574 28.55418 29.28955 6.431929
V5 30.8473944 44.99180 46.64411 22.868543
V6 11.1880372 33.01009 33.18346 6.999027
V7 13.3511887 32.22212 32.66688 14.100210
V8 11.8883317 32.41844 33.03005 7.243705
V9 -0.5020035 19.69505 19.54399 2.501567
V10 -2.9051578 22.24136 20.91442 2.953552
V11 -3.9585608 14.68528 14.11102 1.217768
V12 0.8254453 21.17199 20.75337 3.298964
...
varImp(rfImp)
b g
V3 31.770511 31.770511
V4 19.768070 19.768070
V5 37.919596 37.919596
V6 22.099063 22.099063
V7 22.786656 22.786656
V8 22.153388 22.153388
V9 9.596522 9.596522
V10 9.668101 9.668101
V11 5.363359 5.363359
V12 10.998718 10.998718
...
I think I am missing something...
EDIT 2
I figured out that if you do the mean of each row of the first two columns of importance(rfImp)
, you get the results of varImp(rfImp)
:
impRF <- importance(rfImp)[,1:2]
apply(impRF, 1, function(x) mean(x))
V3 V4 V5 V6 V7 V8 V9
31.770511 19.768070 37.919596 22.099063 22.786656 22.153388 9.596522
V10 V11 V12
9.668101 5.363359 10.998718 ...
# Same result as in both columns of varImp(rfImp)
I do not know why this is happening, but there has to be an explanation for that.
randomForest::importance() aggregates the class-specific importance scores using a weighted mean before it rescales them using their "standard error" and reports it as meanDecreaseAccuracy . varImp() takes the (by default) scaled class-specific scores and averages them without weighting.
The default method to compute variable importance is the mean decrease in impurity (or gini importance) mechanism: At each split in each tree, the improvement in the split-criterion is the importance measure attributed to the splitting variable, and is accumulated over all the trees in the forest separately for each ...
Partial Least Squares: the variable importance measure here is based on weighted sums of the absolute regression coefficients. The weights are a function of the reduction of the sums of squares across the number of PLS components and are computed separately for each outcome.
Variable importance plot provides a list of the most significant variables in descending order by a mean decrease in Gini. The top variables contribute more to the model than the bottom ones and also have high predictive power in classifying default and non-default customers.
If we walk through the method for varImp:
Check the object:
> getFromNamespace('varImp','caret')
function (object, ...)
{
UseMethod("varImp")
}
Get the S3 Method:
> getS3method('varImp','randomForest')
function (object, ...)
{
code <- varImpDependencies("rf")
code$varImp(object, ...)
}
<environment: namespace:caret>
code <- caret:::varImpDependencies('rf')
> code$varImp
function(object, ...){
varImp <- randomForest::importance(object, ...)
if(object$type == "regression")
varImp <- data.frame(Overall = varImp[,"%IncMSE"])
else {
retainNames <- levels(object$y)
if(all(retainNames %in% colnames(varImp))) {
varImp <- varImp[, retainNames]
} else {
varImp <- data.frame(Overall = varImp[,1])
}
}
out <- as.data.frame(varImp)
if(dim(out)[2] == 2) {
tmp <- apply(out, 1, mean)
out[,1] <- out[,2] <- tmp
}
out
}
So this is not strictly returning randomForest::importance,
It starts by calculating that but then selects only the categorical values that are in the dataset.
Then it does something interesting, it checks if we only have two columns:
if(dim(out)[2] == 2) {
tmp <- apply(out, 1, mean)
out[,1] <- out[,2] <- tmp
}
According to the varImp man page:
Random Forest: varImp.randomForest and varImp.RandomForest are wrappers around the importance functions from the randomForest and party packages, respectively.
This is clearly not the case.
As to why...
If we have only two values, the importance of the variable as a predictor can be represented as one value.
If the variable is a predictor of g
, then it must also be a predictor of b
It does make sense, but this doesn't fit their documentation on what the function does, so I would likely report this as unexpected behavior. The function is attempting to assist when you're expecting to do the relative calculation yourself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With