<h3>Question: Is there a way to extract the variable importance for each individual CART model from a <code>randomForest</code> object?</h3> <code>rf_mod$forest</code> doesn't seem to have this information, and the docs don't mention it. <hr> In R's <code>randomForest</code> package, the average variable importance for the entire forest of CART models is given by <code>importance(rf_mod)</code>. <pre class="prettyprint"><code>library(randomForest) df <- mtcars set.seed(1) rf_mod = randomForest(mpg ~ ., data = df, importance = TRUE, ntree = 200) importance(rf_mod) %IncMSE IncNodePurity cyl 6.0927875 111.65028 disp 8.7730959 261.06991 hp 7.8329831 212.74916 drat 2.9529334 79.01387 wt 7.9015687 246.32633 qsec 0.7741212 26.30662 vs 1.6908975 31.95701 am 2.5298261 13.33669 gear 1.5512788 17.77610 carb 3.2346351 35.69909 </code></pre> We can also extract individual tree structure with <code>getTree</code>. Here's the first tree. <pre class="prettyprint"><code>head(getTree(rf_mod, k = 1, labelVar = TRUE)) left daughter right daughter split var split point status prediction 1 2 3 wt 2.15 -3 18.91875 2 0 0 <NA> 0.00 -1 31.56667 3 4 5 wt 3.16 -3 17.61034 4 6 7 drat 3.66 -3 21.26667 5 8 9 carb 3.50 -3 15.96500 6 0 0 <NA> 0.00 -1 19.70000 </code></pre> One workaround is to grow many CARTs (i.e. - <code>ntree = 1</code>), get the variable importance of each tree, and average the resulting <code>%IncMSE</code>: <pre class="prettyprint"><code># number of trees to grow nn <- 200 # function to run nn CART models run_rf <- function(rand_seed){ set.seed(rand_seed) one_tr = randomForest(mpg ~ ., data = df, importance = TRUE, ntree = 1) return(one_tr) } # list to store output of each model l <- vector("list", length = nn) l <- lapply(1:nn, run_rf) </code></pre> The extraction, averaging, and comparison step. <pre class="prettyprint"><code># extract importance of each CART model library(dplyr); library(purrr) map(l, importance) %>% map(as.data.frame) %>% map( ~ { .$var = rownames(.); rownames(.) <- NULL; return(.) } ) %>% bind_rows() %>% group_by(var) %>% summarise(`%IncMSE` = mean(`%IncMSE`)) %>% arrange(-`%IncMSE`) # A tibble: 10 x 2 var `%IncMSE` <chr> <dbl> 1 wt 8.52 2 cyl 7.75 3 disp 7.74 4 hp 5.53 5 drat 1.65 6 carb 1.52 7 vs 0.938 8 qsec 0.824 9 gear 0.495 10 am 0.355 # compare to the RF model above importance(rf_mod) %IncMSE IncNodePurity cyl 6.0927875 111.65028 disp 8.7730959 261.06991 hp 7.8329831 212.74916 drat 2.9529334 79.01387 wt 7.9015687 246.32633 qsec 0.7741212 26.30662 vs 1.6908975 31.95701 am 2.5298261 13.33669 gear 1.5512788 17.77610 carb 3.2346351 35.69909 </code></pre> <hr> I'd like to be able to extract the variable importance of each tree directly from a <code>randomForest</code> object, without this roundabout method that involves completely re-running the RF in order to facilitate reproducible cumulative variable importance plots like this one, and the one below shown for <code>mtcars</code>. Minimal example here. I'm aware that a single tree's variable importance is not statistically meaningful, and it's not my intention to interpret trees in isolation. I want them for the purpose of visualization and communicating that as trees increase in a forest, the variable importance measures jump around before stabilizing. <img src="https://i.stack.imgur.com/Yaa0F.gif" alt="enter image description here">

When training a <code>randomForest</code> model, the importance scores are computed for the entire forest and stored directly inside the object. Tree-specific scores are not kept and so cannot be directly retrieved from a <code>randomForest</code> object. Unfortunately, you are correct about having to incrementally construct a forest. The good news is that a <code>randomForest</code> object is self-contained, and you don't need to implement your own <code>run_rf</code>. Instead, you can use <code>stats::update</code> to re-fit the random forest model with a single tree and <code>randomForest::grow</code> to add additional trees one at a time: <pre class="prettyprint"><code>## Starting with a random forest having a single tree, ## grow it 9 times, one tree at a time rfs <- purrr::accumulate( .init = update(rf_mod, ntree=1), rep(1,9), randomForest::grow ) ## Retrieve the importance scores from each random forest imp <- purrr::map( rfs, ~importance(.x)[,"%IncMSE"] ) ## Combine all results into a single data frame dplyr::bind_rows( !!!imp ) # # A tibble: 10 x 10 # cyl disp hp drat wt qsec vs am gear carb # <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 0 18.8 8.63 1.05 0 1.17 0 0 0 0.194 # 2 0 10.0 46.4 0.561 0 -0.299 0 0 0.543 2.05 # 3 0 22.4 31.2 0.955 0 -0.199 0 0 0.362 5.1 # 4 1.55 24.1 23.4 0.717 0 -0.150 0 0 0.272 5.28 # 5 1.24 22.8 23.6 0.573 0 -0.178 0 0 -0.0259 4.98 # 6 1.03 26.2 22.3 0.478 1.25 0.775 0 0 -0.0216 4.1 # 7 0.887 22.5 22.5 0.406 1.79 -0.101 0 0 -0.0185 3.56 # 8 0.776 19.7 21.3 0.944 1.70 0.105 0 0.0225 -0.0162 3.11 # 9 0.690 18.4 19.1 0.839 1.51 1.24 1.01 0.02 -0.0144 2.77 # 10 0.621 18.4 21.2 0.937 1.32 1.11 0.910 0.0725 -0.114 2.49 </code></pre> The data frame shows how feature importance changes with each additional tree. This is the right panel of your plot example. The trees themselves (for the left panel) can be retrieved from the final forest, which is given by <code>dplyr::last( rfs )</code>.

We can simplify it by <pre class="prettyprint"><code>library(tidyverse) out <- map(seq_len(nn), ~ run_rf(.x) %>% importance) %>% reduce(`+`) %>% magrittr::divide_by(nn) </code></pre>

Obtain importance of individual trees in a RandomForest

Question: Is there a way to extract the variable importance for each individual CART model from a `randomForest` object?

rf_mod$forest doesn't seem to have this information, and the docs don't mention it.

In R's randomForest package, the average variable importance for the entire forest of CART models is given by importance(rf_mod).

library(randomForest)

df <- mtcars

set.seed(1)
rf_mod = randomForest(mpg ~ ., 
                      data = df, 
                      importance = TRUE, 
                      ntree = 200)

importance(rf_mod)

       %IncMSE IncNodePurity
cyl  6.0927875     111.65028
disp 8.7730959     261.06991
hp   7.8329831     212.74916
drat 2.9529334      79.01387
wt   7.9015687     246.32633
qsec 0.7741212      26.30662
vs   1.6908975      31.95701
am   2.5298261      13.33669
gear 1.5512788      17.77610
carb 3.2346351      35.69909

We can also extract individual tree structure with getTree. Here's the first tree.

head(getTree(rf_mod, k = 1, labelVar = TRUE))
  left daughter right daughter split var split point status prediction
1             2              3        wt        2.15     -3   18.91875
2             0              0      <NA>        0.00     -1   31.56667
3             4              5        wt        3.16     -3   17.61034
4             6              7      drat        3.66     -3   21.26667
5             8              9      carb        3.50     -3   15.96500
6             0              0      <NA>        0.00     -1   19.70000

One workaround is to grow many CARTs (i.e. - ntree = 1), get the variable importance of each tree, and average the resulting %IncMSE:

# number of trees to grow
nn <- 200

# function to run nn CART models 
run_rf <- function(rand_seed){
  set.seed(rand_seed)
  one_tr = randomForest(mpg ~ ., 
                        data = df, 
                        importance = TRUE, 
                        ntree = 1)
  return(one_tr)
}

# list to store output of each model
l <- vector("list", length = nn)
l <- lapply(1:nn, run_rf)

The extraction, averaging, and comparison step.

# extract importance of each CART model 
library(dplyr); library(purrr)
map(l, importance) %>% 
  map(as.data.frame) %>% 
  map( ~ { .$var = rownames(.); rownames(.) <- NULL; return(.) } ) %>% 
  bind_rows() %>% 
  group_by(var) %>% 
  summarise(`%IncMSE` = mean(`%IncMSE`)) %>% 
  arrange(-`%IncMSE`)

    # A tibble: 10 x 2
   var   `%IncMSE`
   <chr>     <dbl>
 1 wt        8.52 
 2 cyl       7.75 
 3 disp      7.74 
 4 hp        5.53 
 5 drat      1.65 
 6 carb      1.52 
 7 vs        0.938
 8 qsec      0.824
 9 gear      0.495
10 am        0.355

# compare to the RF model above
importance(rf_mod)

       %IncMSE IncNodePurity
cyl  6.0927875     111.65028
disp 8.7730959     261.06991
hp   7.8329831     212.74916
drat 2.9529334      79.01387
wt   7.9015687     246.32633
qsec 0.7741212      26.30662
vs   1.6908975      31.95701
am   2.5298261      13.33669
gear 1.5512788      17.77610
carb 3.2346351      35.69909

I'd like to be able to extract the variable importance of each tree directly from a randomForest object, without this roundabout method that involves completely re-running the RF in order to facilitate reproducible cumulative variable importance plots like this one, and the one below shown for mtcars. Minimal example here.

I'm aware that a single tree's variable importance is not statistically meaningful, and it's not my intention to interpret trees in isolation. I want them for the purpose of visualization and communicating that as trees increase in a forest, the variable importance measures jump around before stabilizing.

enter image description here

578

asked May 04 '19 17:05

Rich Pauloo

3 Answers

When training a randomForest model, the importance scores are computed for the entire forest and stored directly inside the object. Tree-specific scores are not kept and so cannot be directly retrieved from a randomForest object.

Unfortunately, you are correct about having to incrementally construct a forest. The good news is that a randomForest object is self-contained, and you don't need to implement your own run_rf. Instead, you can use stats::update to re-fit the random forest model with a single tree and randomForest::grow to add additional trees one at a time:

## Starting with a random forest having a single tree,
##   grow it 9 times, one tree at a time
rfs <- purrr::accumulate( .init = update(rf_mod, ntree=1),
                          rep(1,9), randomForest::grow )

## Retrieve the importance scores from each random forest
imp <- purrr::map( rfs, ~importance(.x)[,"%IncMSE"] )

## Combine all results into a single data frame
dplyr::bind_rows( !!!imp )
# # A tibble: 10 x 10
#      cyl  disp    hp  drat    wt   qsec    vs     am    gear  carb
#    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>   <dbl> <dbl>
#  1 0      18.8  8.63 1.05   0     1.17  0     0       0      0.194
#  2 0      10.0 46.4  0.561  0    -0.299 0     0       0.543  2.05 
#  3 0      22.4 31.2  0.955  0    -0.199 0     0       0.362  5.1
#  4 1.55   24.1 23.4  0.717  0    -0.150 0     0       0.272  5.28
#  5 1.24   22.8 23.6  0.573  0    -0.178 0     0      -0.0259 4.98
#  6 1.03   26.2 22.3  0.478  1.25  0.775 0     0      -0.0216 4.1
#  7 0.887  22.5 22.5  0.406  1.79 -0.101 0     0      -0.0185 3.56
#  8 0.776  19.7 21.3  0.944  1.70  0.105 0     0.0225 -0.0162 3.11
#  9 0.690  18.4 19.1  0.839  1.51  1.24  1.01  0.02   -0.0144 2.77
# 10 0.621  18.4 21.2  0.937  1.32  1.11  0.910 0.0725 -0.114  2.49

The data frame shows how feature importance changes with each additional tree. This is the right panel of your plot example. The trees themselves (for the left panel) can be retrieved from the final forest, which is given by dplyr::last( rfs ).

144

answered Oct 27 '22 22:10

Artem Sokolov

Disclaimer: This is not really an answer, but too long to post as a comment. Will remove if deemed not appropriate.

While I (think I) understand your question, to be honest I am unsure whether your question makes sense from a statistics/ML point-of-view. The following is based on my obviously limited understanding of RF and CART. Perhaps my comment-post will lead to some insights.

Let's start with some general random forest (RF) theory on variable importance from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, p. 593 (bold-face mine):

At each split in each tree, the improvement in the split-criterion is the importance measure attributed to the splitting variable, and is accumulated over all the trees in the forest separately for each variable. [...] Random forests also use the oob samples to construct a different variable-importance measure, apparently to measure the prediction strength of each variable.

So the variable importance measure in RF is defined as a measure accumulated over all trees.

In traditional single classification trees (CARTs), variable importance is characterised through the Gini index that measures node impurity (see e.g. How to measure/rank “variable importance” when using CART? (specifically using {rpart} from R) and Carolin Strobl's PhD thesis)

More complex measures to characterise variable importance in CART-like models exist; for example in rpart:

An overall measure of variable importance is the sum of the goodness of split measures for each split for which it was the primary variable, plus goodness * (adjusted agreement) for all splits in which it was a surrogate. In the printout these are scaled to sum to 100 and the rounded values are shown, omitting any variable whose proportion is less than 1%.

So the bottom line here is the following: At the very least it won't be easy (and in the worst case it won't make sense) to compare variable measures from single classifaction trees with variable importance measures applied to ensemble-based methods like RF.

Which leads me to ask: Why do you want to extract variable importance measures for individual trees from an RF model? Even if you came up with a method to calculate variable importances from individual trees, I believe they wouldn't be very meaningful, and they wouldn't have to "converge" to the ensemble-accumulated values.

answered Oct 27 '22 20:10

Maurits Evers

We can simplify it by

library(tidyverse)
out <- map(seq_len(nn),  ~ 
          run_rf(.x) %>% 
          importance) %>%
       reduce(`+`) %>% 
       magrittr::divide_by(nn)

answered Oct 27 '22 21:10

akrun

Related questions
                            
                                Interpretation of "stat_summary = mean_cl_boot" at ggplot2?
                            
                                Regression and summary statistics by group within a data.table
                            
                                Error with setwd in R
                            
                                grid.arrange using list of plots
                            
                                Side by side Xtables in Rmarkdown
                            
                                How to define more line types for graphs in R (custom linetype)?
                            
                                Adding two vectors by names
                            
                                Filter each column of a data.frame based on a specific value
                            
                                ggplot bar chart for time series
                            
                                R table function - how to remove 0 counts?
                            
                                Update an entire row in data.table in R
                            
                                Can you more clearly explain lazy evaluation in R function operators?
                            
                                Format latitude and longitude axis labels in ggplot
                            
                                Dollar operator as function argument for sapply not working as expected
                            
                                Separating column using separate (tidyr) via dplyr on a first encountered digit
                            
                                What is the difference between the "+" operator in ggplot2 and the "%>%" operator in magrittr?
                            
                                What is the difference between [[]] and $ in list indexing?
                            
                                Changing axis titles for autoplot
                            
                                Make a group_indices based on several columns
                            
                                Error in dataframe *tmp* replacement has x data has y

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Obtain importance of individual trees in a RandomForest

Tags:

r

machine-learning

random-forest

Question: Is there a way to extract the variable importance for each individual CART model from a `randomForest` object?

Rich Pauloo

People also ask

3 Answers

Artem Sokolov

Maurits Evers

akrun

Recent Activity

Donate For Us

Obtain importance of individual trees in a RandomForest

Tags:

r

machine-learning

random-forest

Question: Is there a way to extract the variable importance for each individual CART model from a randomForest object?

Rich Pauloo

People also ask

3 Answers

Artem Sokolov

Maurits Evers

akrun

Related questions

Recent Activity

Donate For Us

Question: Is there a way to extract the variable importance for each individual CART model from a `randomForest` object?