I have the following code, which basically try to predict the Species
from iris
data using randomForest. What I'm really intersed in is to find what are the best features (variable) that explain the species classification. I found the package randomForestExplainer is the best
to serve the purpose.
library(randomForest)
library(randomForestExplainer)
forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE)
importance_frame <- randomForestExplainer::measure_importance(forest)
randomForestExplainer::plot_multi_way_importance(importance_frame, size_measure = "no_of_nodes")
The result of the code produce this plot:
Based on the plot, the key factor to explain why Petal.Length and Petal.Width is the best factor are these (the explanation is based on the vignette):
mean_min_depth
– mean minimal depth calculated in one of three ways specified by the parameter mean_sample,times_a_root
– total number of trees in which Xj is used for splitting the root node (i.e., the whole sample is divided into two based on the value of Xj),no_of_nodes
– total number of nodes that use Xj for splitting (it is usually equal to no_of_trees if trees are shallow),It's not entirely clear to me why the high times_a_root
and no_of_nodes
is better? And low mean_min_depth
is better?
What are the intuitive explanation for that?
The vignette information doesn't help.
You would like a statistical model or measure to be a balance between "power" and "parsimony". The randomForest is designed internally to do penalization as its statistical strategy for achieving parsimony. Furthermore the number of variables selected in any given sample will be less than the the total number of predictors. This allows model building when hte number of predictors exceeds the number of cases (rows) in the dataset. Early splitting or classification rules can be applied relatively easily, but subsequent splits become increasingly difficult to meet criteria of validity. "Power" is the ability to correctly classify items that were not in the subsample, for which a proxy, the so-called OOB or "out-of-bag" items is used. The randomForest strategy is to do this many times to build up a representative set of rules that classify items under the assumptions that the out-of-bag samples will be a fair representation of the "universe" from which the whole dataset arose.
The times_a_root
would fall into the category of measuring the "relative power" of a variable compared to its "competitors". The times_a_root
statistic measures the number of times a variable is "at the top" of a decision tree, i.e., how likely it is to be chosen first in the process of selecting split criteria. The no_of_node
measures the number of times the variable is chosen at all as a splitting criterion among all of the subsampled.
From:
?randomForest # to find the names of the object leaves
forest$ntree
[1] 500
... we can see get a denominator for assessing the meaning of the roughly 200
values in the y-axis of the plot. About 2/5ths of the sample regressions had Petal.Length
in the top split criterion, while another 2/5ths had Petal.Width
as the top variable selected as the most important variable. About 75 of 500 had Sepal.Length
while only about 8 or 9 had Sepal.Width
(... it's a log scale.) In the case of the iris dataset, the subsamples would have ignored at least one of the variables in each subsample, so the maximum possible value of times_a_root
would have been less than 500. Scores of 200 are pretty good in this situation and we can see that both of these variables have a comparable explanatory ability.
The no_of_nodes
statistic totals up the total number of trees that had that variable in any of its nodes, remembering that the number of nodes would be constrained by the penalization rules.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With