Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Obtain importance of individual trees in a RandomForest

Question: Is there a way to extract the variable importance for each individual CART model from a randomForest object?

rf_mod$forest doesn't seem to have this information, and the docs don't mention it.


In R's randomForest package, the average variable importance for the entire forest of CART models is given by importance(rf_mod).

library(randomForest)

df <- mtcars

set.seed(1)
rf_mod = randomForest(mpg ~ ., 
                      data = df, 
                      importance = TRUE, 
                      ntree = 200)

importance(rf_mod)

       %IncMSE IncNodePurity
cyl  6.0927875     111.65028
disp 8.7730959     261.06991
hp   7.8329831     212.74916
drat 2.9529334      79.01387
wt   7.9015687     246.32633
qsec 0.7741212      26.30662
vs   1.6908975      31.95701
am   2.5298261      13.33669
gear 1.5512788      17.77610
carb 3.2346351      35.69909

We can also extract individual tree structure with getTree. Here's the first tree.

head(getTree(rf_mod, k = 1, labelVar = TRUE))
  left daughter right daughter split var split point status prediction
1             2              3        wt        2.15     -3   18.91875
2             0              0      <NA>        0.00     -1   31.56667
3             4              5        wt        3.16     -3   17.61034
4             6              7      drat        3.66     -3   21.26667
5             8              9      carb        3.50     -3   15.96500
6             0              0      <NA>        0.00     -1   19.70000

One workaround is to grow many CARTs (i.e. - ntree = 1), get the variable importance of each tree, and average the resulting %IncMSE:

# number of trees to grow
nn <- 200

# function to run nn CART models 
run_rf <- function(rand_seed){
  set.seed(rand_seed)
  one_tr = randomForest(mpg ~ ., 
                        data = df, 
                        importance = TRUE, 
                        ntree = 1)
  return(one_tr)
}

# list to store output of each model
l <- vector("list", length = nn)
l <- lapply(1:nn, run_rf)

The extraction, averaging, and comparison step.

# extract importance of each CART model 
library(dplyr); library(purrr)
map(l, importance) %>% 
  map(as.data.frame) %>% 
  map( ~ { .$var = rownames(.); rownames(.) <- NULL; return(.) } ) %>% 
  bind_rows() %>% 
  group_by(var) %>% 
  summarise(`%IncMSE` = mean(`%IncMSE`)) %>% 
  arrange(-`%IncMSE`)

    # A tibble: 10 x 2
   var   `%IncMSE`
   <chr>     <dbl>
 1 wt        8.52 
 2 cyl       7.75 
 3 disp      7.74 
 4 hp        5.53 
 5 drat      1.65 
 6 carb      1.52 
 7 vs        0.938
 8 qsec      0.824
 9 gear      0.495
10 am        0.355

# compare to the RF model above
importance(rf_mod)

       %IncMSE IncNodePurity
cyl  6.0927875     111.65028
disp 8.7730959     261.06991
hp   7.8329831     212.74916
drat 2.9529334      79.01387
wt   7.9015687     246.32633
qsec 0.7741212      26.30662
vs   1.6908975      31.95701
am   2.5298261      13.33669
gear 1.5512788      17.77610
carb 3.2346351      35.69909

I'd like to be able to extract the variable importance of each tree directly from a randomForest object, without this roundabout method that involves completely re-running the RF in order to facilitate reproducible cumulative variable importance plots like this one, and the one below shown for mtcars. Minimal example here.

I'm aware that a single tree's variable importance is not statistically meaningful, and it's not my intention to interpret trees in isolation. I want them for the purpose of visualization and communicating that as trees increase in a forest, the variable importance measures jump around before stabilizing.

enter image description here

like image 578
Rich Pauloo Avatar asked May 04 '19 17:05

Rich Pauloo


People also ask

What is individual tree in random forest?

The Random Forest Classifier Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model's prediction (see figure below).

What is random forest explain its importance?

Random Forest Built-in Feature ImportanceIt is a set of Decision Trees. Each Decision Tree is a set of internal nodes and leaves. In the internal node, the selected feature is used to make decision how to divide the data set into two separate sets with similars responses within.

How many trees are in Randomforest?

They suggest that a random forest should have a number of trees between 64 - 128 trees.

What are some of the benefits of using a Randomforest classifier?

Advantages of random forestIt can perform both regression and classification tasks. A random forest produces good predictions that can be understood easily. It can handle large datasets efficiently. The random forest algorithm provides a higher level of accuracy in predicting outcomes over the decision tree algorithm.


3 Answers

When training a randomForest model, the importance scores are computed for the entire forest and stored directly inside the object. Tree-specific scores are not kept and so cannot be directly retrieved from a randomForest object.

Unfortunately, you are correct about having to incrementally construct a forest. The good news is that a randomForest object is self-contained, and you don't need to implement your own run_rf. Instead, you can use stats::update to re-fit the random forest model with a single tree and randomForest::grow to add additional trees one at a time:

## Starting with a random forest having a single tree,
##   grow it 9 times, one tree at a time
rfs <- purrr::accumulate( .init = update(rf_mod, ntree=1),
                          rep(1,9), randomForest::grow )

## Retrieve the importance scores from each random forest
imp <- purrr::map( rfs, ~importance(.x)[,"%IncMSE"] )

## Combine all results into a single data frame
dplyr::bind_rows( !!!imp )
# # A tibble: 10 x 10
#      cyl  disp    hp  drat    wt   qsec    vs     am    gear  carb
#    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>   <dbl> <dbl>
#  1 0      18.8  8.63 1.05   0     1.17  0     0       0      0.194
#  2 0      10.0 46.4  0.561  0    -0.299 0     0       0.543  2.05 
#  3 0      22.4 31.2  0.955  0    -0.199 0     0       0.362  5.1
#  4 1.55   24.1 23.4  0.717  0    -0.150 0     0       0.272  5.28
#  5 1.24   22.8 23.6  0.573  0    -0.178 0     0      -0.0259 4.98
#  6 1.03   26.2 22.3  0.478  1.25  0.775 0     0      -0.0216 4.1
#  7 0.887  22.5 22.5  0.406  1.79 -0.101 0     0      -0.0185 3.56
#  8 0.776  19.7 21.3  0.944  1.70  0.105 0     0.0225 -0.0162 3.11
#  9 0.690  18.4 19.1  0.839  1.51  1.24  1.01  0.02   -0.0144 2.77
# 10 0.621  18.4 21.2  0.937  1.32  1.11  0.910 0.0725 -0.114  2.49

The data frame shows how feature importance changes with each additional tree. This is the right panel of your plot example. The trees themselves (for the left panel) can be retrieved from the final forest, which is given by dplyr::last( rfs ).

like image 144
Artem Sokolov Avatar answered Oct 27 '22 22:10

Artem Sokolov


Disclaimer: This is not really an answer, but too long to post as a comment. Will remove if deemed not appropriate.

While I (think I) understand your question, to be honest I am unsure whether your question makes sense from a statistics/ML point-of-view. The following is based on my obviously limited understanding of RF and CART. Perhaps my comment-post will lead to some insights.

Let's start with some general random forest (RF) theory on variable importance from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, p. 593 (bold-face mine):

At each split in each tree, the improvement in the split-criterion is the importance measure attributed to the splitting variable, and is accumulated over all the trees in the forest separately for each variable. [...] Random forests also use the oob samples to construct a different variable-importance measure, apparently to measure the prediction strength of each variable.

So the variable importance measure in RF is defined as a measure accumulated over all trees.


In traditional single classification trees (CARTs), variable importance is characterised through the Gini index that measures node impurity (see e.g. How to measure/rank “variable importance” when using CART? (specifically using {rpart} from R) and Carolin Strobl's PhD thesis)

More complex measures to characterise variable importance in CART-like models exist; for example in rpart:

An overall measure of variable importance is the sum of the goodness of split measures for each split for which it was the primary variable, plus goodness * (adjusted agreement) for all splits in which it was a surrogate. In the printout these are scaled to sum to 100 and the rounded values are shown, omitting any variable whose proportion is less than 1%.


So the bottom line here is the following: At the very least it won't be easy (and in the worst case it won't make sense) to compare variable measures from single classifaction trees with variable importance measures applied to ensemble-based methods like RF.

Which leads me to ask: Why do you want to extract variable importance measures for individual trees from an RF model? Even if you came up with a method to calculate variable importances from individual trees, I believe they wouldn't be very meaningful, and they wouldn't have to "converge" to the ensemble-accumulated values.

like image 45
Maurits Evers Avatar answered Oct 27 '22 20:10

Maurits Evers


We can simplify it by

library(tidyverse)
out <- map(seq_len(nn),  ~ 
          run_rf(.x) %>% 
          importance) %>%
       reduce(`+`) %>% 
       magrittr::divide_by(nn)
like image 3
akrun Avatar answered Oct 27 '22 21:10

akrun