I'm trying to understand how feature importance is calculated for decision trees in sci-kit learn. This question has been asked before, but I am unable to reproduce the results the algorithm is providing. For example: <pre class="prettyprint"><code>from StringIO import StringIO from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn.tree.export import export_graphviz from sklearn.feature_selection import mutual_info_classif X = [[1,0,0], [0,0,0], [0,0,1], [0,1,0]] y = [1,0,1,1] clf = DecisionTreeClassifier() clf.fit(X, y) feat_importance = clf.tree_.compute_feature_importances(normalize=False) print("feat importance = " + str(feat_importance)) out = StringIO() out = export_graphviz(clf, out_file='test/tree.dot') </code></pre> results in feature importance: <pre class="prettyprint"><code>feat importance = [0.25 0.08333333 0.04166667] </code></pre> and gives the following decision tree: <img src="https://i.stack.imgur.com/ORz4y.png" alt="decision tree"> Now, this answer to a similar question suggests the importance is calculated as <img src="https://i.stack.imgur.com/tmIZI.gif" alt="formula_a"> Where G is the node impurity, in this case the gini impurity. This is the impurity reduction as far as I understood it. However, for feature 1 this should be: <img src="https://i.stack.imgur.com/9MAv2.gif" alt="formula_b"> This answer suggests the importance is weighted by the probability of reaching the node (which is approximated by the proportion of samples reaching that node). Again, for feature 1 this should be: <img src="https://i.stack.imgur.com/LvISu.gif" alt="formula_c"> Both formulas provide the wrong result. How is the feature importance calculated correctly?

I think feature importance depends on the implementation so we need to look at the documentation of scikit-learn. <blockquote> The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance </blockquote> That reduction or weighted information gain is defined as : <blockquote> The weighted impurity decrease equation is the following: <code>N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)</code> where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child. </blockquote> http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier Since each feature is used once in your case, feature information must be equal to equation above. For X[2] : <code>feature_importance = (4 / 4) * (0.375 - (0.75 * 0.444)) = 0.042</code> For X[1] : <code>feature_importance = (3 / 4) * (0.444 - (2/3 * 0.5)) = 0.083</code> For X[0] : <code>feature_importance = (2 / 4) * (0.5) = 0.25</code>

A single feature can be used in the different branches of the tree, feature importance then is it's total contribution in reducing the impurity. <pre class="prettyprint lang-py prettyprint-override"><code>feature_importance += number_of_samples_at_parent_where_feature_is_used\*impurity_at_parent-left_child_samples\*impurity_left-right_child_samples\*impurity_right </code></pre> impurity is the gini/entropy value <pre class="prettyprint lang-py prettyprint-override"><code>normalized_importance = feature_importance/number_of_samples_root_node(total num of samples) </code></pre> In the above eg: <pre class="prettyprint lang-py prettyprint-override"><code>feature_2_importance = 0.375*4-0.444*3-0*1 = 0.16799 , normalized = 0.16799/4(total_num_of_samples) = 0.04199 </code></pre> If <code>feature_2</code> was used in other branches calculate the it's importance at each such parent node & sum up the values. There is a difference in the feature importance calculated & the ones returned by the library as we are using the truncated values seen in the graph. Instead, we can access all the required data using the 'tree_' attribute of the classifier which can be used to probe the features used, threshold value, impurity, no of samples at each node etc.. eg: <code>clf.tree_.feature</code> gives the list of features used. A negative value indicates it's a leaf node. Similarly <code>clf.tree_.children_left/right</code> gives the index to the <code>clf.tree_.feature</code> for left & right children Using the above traverse the tree & use the same indices in <code>clf.tree_.impurity & clf.tree_.weighted_n_node_samples</code> to get the gini/entropy value and number of samples at the each node & at it's children. <pre class="prettyprint lang-py prettyprint-override"><code>def dt_feature_importance(model,normalize=True): left_c = model.tree_.children_left right_c = model.tree_.children_right impurity = model.tree_.impurity node_samples = model.tree_.weighted_n_node_samples # Initialize the feature importance, those not used remain zero feature_importance = np.zeros((model.tree_.n_features,)) for idx,node in enumerate(model.tree_.feature): if node >= 0: # Accumulate the feature importance over all the nodes where it's used feature_importance[node]+=impurity[idx]*node_samples[idx]- \ impurity[left_c[idx]]*node_samples[left_c[idx]]-\ impurity[right_c[idx]]*node_samples[right_c[idx]] # Number of samples at the root node feature_importance/=node_samples[0] if normalize: normalizer = feature_importance.sum() if normalizer > 0: feature_importance/=normalizer return feature_importance </code></pre> This function will return the exact same values as returned by <code>clf.tree_.compute_feature_importances(normalize=...)</code> To sort the features based on their importance <pre class="prettyprint lang-py prettyprint-override"><code>features = clf.tree_.feature[clf.tree_.feature>=0] # Feature number should not be negative, indicates a leaf node sorted(zip(features,dt_feature_importance(clf,False)[features]),key=lambda x:x[1],reverse=True) </code></pre>

scikit learn - feature importance calculation in decision trees

Tags:

I'm trying to understand how feature importance is calculated for decision trees in sci-kit learn. This question has been asked before, but I am unable to reproduce the results the algorithm is providing.

For example:

from StringIO import StringIO  from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn.tree.export import export_graphviz from sklearn.feature_selection import mutual_info_classif  X = [[1,0,0], [0,0,0], [0,0,1], [0,1,0]]  y = [1,0,1,1]  clf = DecisionTreeClassifier() clf.fit(X, y)  feat_importance = clf.tree_.compute_feature_importances(normalize=False) print("feat importance = " + str(feat_importance))  out = StringIO() out = export_graphviz(clf, out_file='test/tree.dot')

results in feature importance:

feat importance = [0.25       0.08333333 0.04166667]

and gives the following decision tree:

decision tree

Now, this answer to a similar question suggests the importance is calculated as

formula_a

Where G is the node impurity, in this case the gini impurity. This is the impurity reduction as far as I understood it. However, for feature 1 this should be:

formula_b

This answer suggests the importance is weighted by the probability of reaching the node (which is approximated by the proportion of samples reaching that node). Again, for feature 1 this should be:

formula_c

Both formulas provide the wrong result. How is the feature importance calculated correctly?

482

asked Mar 08 '18 10:03

Roman Purgstaller

2 Answers

I think feature importance depends on the implementation so we need to look at the documentation of scikit-learn.

The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance

That reduction or weighted information gain is defined as :

The weighted impurity decrease equation is the following:

N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

Since each feature is used once in your case, feature information must be equal to equation above.

For X[2] :

feature_importance = (4 / 4) * (0.375 - (0.75 * 0.444)) = 0.042

For X[1] :

feature_importance = (3 / 4) * (0.444 - (2/3 * 0.5)) = 0.083

For X[0] :

feature_importance = (2 / 4) * (0.5) = 0.25

answered Sep 23 '22 07:09

Seljuk Gülcan

A single feature can be used in the different branches of the tree, feature importance then is it's total contribution in reducing the impurity.

feature_importance += number_of_samples_at_parent_where_feature_is_used\*impurity_at_parent-left_child_samples\*impurity_left-right_child_samples\*impurity_right

impurity is the gini/entropy value

normalized_importance = feature_importance/number_of_samples_root_node(total num of samples)

In the above eg:

feature_2_importance = 0.375*4-0.444*3-0*1 = 0.16799 ,  normalized = 0.16799/4(total_num_of_samples) = 0.04199

If feature_2 was used in other branches calculate the it's importance at each such parent node & sum up the values.

There is a difference in the feature importance calculated & the ones returned by the library as we are using the truncated values seen in the graph.

Instead, we can access all the required data using the 'tree_' attribute of the classifier which can be used to probe the features used, threshold value, impurity, no of samples at each node etc..

eg: clf.tree_.feature gives the list of features used. A negative value indicates it's a leaf node.

Similarly clf.tree_.children_left/right gives the index to the clf.tree_.feature for left & right children

Using the above traverse the tree & use the same indices in clf.tree_.impurity & clf.tree_.weighted_n_node_samples to get the gini/entropy value and number of samples at the each node & at it's children.

def dt_feature_importance(model,normalize=True):      left_c = model.tree_.children_left     right_c = model.tree_.children_right      impurity = model.tree_.impurity         node_samples = model.tree_.weighted_n_node_samples       # Initialize the feature importance, those not used remain zero     feature_importance = np.zeros((model.tree_.n_features,))      for idx,node in enumerate(model.tree_.feature):         if node >= 0:             # Accumulate the feature importance over all the nodes where it's used             feature_importance[node]+=impurity[idx]*node_samples[idx]- \                                    impurity[left_c[idx]]*node_samples[left_c[idx]]-\                                    impurity[right_c[idx]]*node_samples[right_c[idx]]      # Number of samples at the root node     feature_importance/=node_samples[0]      if normalize:         normalizer = feature_importance.sum()         if normalizer > 0:             feature_importance/=normalizer      return feature_importance

This function will return the exact same values as returned by clf.tree_.compute_feature_importances(normalize=...)

To sort the features based on their importance

features = clf.tree_.feature[clf.tree_.feature>=0] # Feature number should not be negative, indicates a leaf node sorted(zip(features,dt_feature_importance(clf,False)[features]),key=lambda x:x[1],reverse=True)

answered Sep 20 '22 07:09

bhasuru

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

scikit learn - feature importance calculation in decision trees

Tags:

Roman Purgstaller

People also ask

2 Answers

Seljuk Gülcan

bhasuru

Recent Activity

Donate For Us

scikit learn - feature importance calculation in decision trees

Tags:

Roman Purgstaller

People also ask

2 Answers

Seljuk Gülcan

bhasuru

Related questions

Recent Activity

Donate For Us