Difference between varImp (caret) and importance (randomForest) for Random Forest

Tags:

I do not understand which is the difference between varImp function (caret package) and importance function (randomForest package) for a Random Forest model:

I computed a simple RF classification model and when computing variable importance, I found that the "ranking" of predictors was not the same for both functions:

Here is my code:

rfImp <- randomForest(Origin ~ ., data = TAll_CS,
                       ntree = 2000,
                       importance = TRUE)

importance(rfImp)

                                 BREAST       LUNG MeanDecreaseAccuracy MeanDecreaseGini
Energy_GLCM_R1SC4NG3        -1.44116806  2.8918537            1.0929302        0.3712622
Contrast_GLCM_R1SC4NG3      -2.61146974  1.5848150           -0.4455327        0.2446930
Entropy_GLCM_R1SC4NG3       -3.42017102  3.8839464            0.9779201        0.4170445
...

varImp(rfImp)
                                 BREAST        LUNG
Energy_GLCM_R1SC4NG3         0.72534283  0.72534283
Contrast_GLCM_R1SC4NG3      -0.51332737 -0.51332737
Entropy_GLCM_R1SC4NG3        0.23188771  0.23188771
...

I thought they used the same "algorithm" but I am not sure now.

EDIT

In order to reproduce the problem, the ionosphere dataset (kknn package) can be used:

library(kknn)
data(ionosphere)
rfImp <- randomForest(class ~ ., data = ionosphere[,3:35],
                       ntree = 2000,
                       importance = TRUE)
importance(rfImp)
             b        g MeanDecreaseAccuracy MeanDecreaseGini
V3  21.3106205 42.23040             42.16524        15.770711
V4  10.9819574 28.55418             29.28955         6.431929
V5  30.8473944 44.99180             46.64411        22.868543
V6  11.1880372 33.01009             33.18346         6.999027
V7  13.3511887 32.22212             32.66688        14.100210
V8  11.8883317 32.41844             33.03005         7.243705
V9  -0.5020035 19.69505             19.54399         2.501567
V10 -2.9051578 22.24136             20.91442         2.953552
V11 -3.9585608 14.68528             14.11102         1.217768
V12  0.8254453 21.17199             20.75337         3.298964
...

varImp(rfImp)
            b         g
V3  31.770511 31.770511
V4  19.768070 19.768070
V5  37.919596 37.919596
V6  22.099063 22.099063
V7  22.786656 22.786656
V8  22.153388 22.153388
V9   9.596522  9.596522
V10  9.668101  9.668101
V11  5.363359  5.363359
V12 10.998718 10.998718
...

I think I am missing something...

EDIT 2

I figured out that if you do the mean of each row of the first two columns of importance(rfImp), you get the results of varImp(rfImp):

impRF <- importance(rfImp)[,1:2]
apply(impRF, 1, function(x) mean(x))
       V3        V4        V5        V6        V7        V8        V9 
31.770511 19.768070 37.919596 22.099063 22.786656 22.153388  9.596522 
      V10       V11       V12 
 9.668101  5.363359 10.998718     ...

# Same result as in both columns of varImp(rfImp)

I do not know why this is happening, but there has to be an explanation for that.

745

asked Jun 17 '16 18:06

Rafa OR

1 Answers

If we walk through the method for varImp:

Check the object:

> getFromNamespace('varImp','caret')
function (object, ...) 
{
    UseMethod("varImp")
}

Get the S3 Method:

> getS3method('varImp','randomForest')
function (object, ...) 
{
    code <- varImpDependencies("rf")
    code$varImp(object, ...)
}
<environment: namespace:caret>


code <- caret:::varImpDependencies('rf')

> code$varImp
function(object, ...){
                    varImp <- randomForest::importance(object, ...)
                    if(object$type == "regression")
                      varImp <- data.frame(Overall = varImp[,"%IncMSE"])
                    else {
                      retainNames <- levels(object$y)
                      if(all(retainNames %in% colnames(varImp))) {
                        varImp <- varImp[, retainNames]
                      } else {
                        varImp <- data.frame(Overall = varImp[,1])
                      }
                    }

                    out <- as.data.frame(varImp)
                    if(dim(out)[2] == 2) {
                      tmp <- apply(out, 1, mean)
                      out[,1] <- out[,2] <- tmp  
                    }
                    out
                  }

So this is not strictly returning randomForest::importance,

It starts by calculating that but then selects only the categorical values that are in the dataset.

Then it does something interesting, it checks if we only have two columns:

if(dim(out)[2] == 2) {
   tmp <- apply(out, 1, mean)
   out[,1] <- out[,2] <- tmp  
}

According to the varImp man page:

Random Forest: varImp.randomForest and varImp.RandomForest are wrappers around the importance functions from the randomForest and party packages, respectively.

This is clearly not the case.

As to why...

If we have only two values, the importance of the variable as a predictor can be represented as one value.

If the variable is a predictor of g, then it must also be a predictor of b

It does make sense, but this doesn't fit their documentation on what the function does, so I would likely report this as unexpected behavior. The function is attempting to assist when you're expecting to do the relative calculation yourself.

100

answered Oct 05 '22 23:10

Shape

Related questions
                            
                                pass function arguments to both dplyr and ggplot
                            
                                using modal window in Shiny module
                            
                                Can't remove gridlines when plotting with geom_sf
                            
                                R Shiny DT - edit values in table with reactive
                            
                                Line spacing for wrapped text in ggplot
                            
                                R: Vectorize loop to create pairwise matrix
                            
                                Add a vertical line with different intercept for each panel in ggplot2
                            
                                ESS workflow for R project/package development
                            
                                Fastest way to cross-tabulate two massive logical vectors in R
                            
                                Identify a linear feature on a raster map and return a linear shape object using R
                            
                                If statement with multiple actions in R
                            
                                Is there a way to increase the height of the strip.text bar in a facet?
                            
                                forward stepwise regression
                            
                                Error when installing: cannot coerce type 'closure' to vector of type 'character'
                            
                                Getting the error "level sets of factors are different" when running a for loop
                            
                                Smoothing out ggplot2 map
                            
                                How can I print all the elements of a vector in a single string in R?
                            
                                R: Calculate cosine distance from a term-document matrix with tm and proxy
                            
                                Filter factor levels in R using dplyr
                            
                                How to flatten R data frame that contains lists?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between varImp (caret) and importance (randomForest) for Random Forest

Tags:

r

random-forest

feature-selection

r-caret

Rafa OR

People also ask

1 Answers

Shape

Recent Activity

Donate For Us