Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn min_impurity_decrease explanation

The definition of min_impurity_decrease in sklearn is

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

Using the Iris dataset, and putting min_impurity_decrease = 0.0

How the tree looks when min_impurity_decrease = 0.0

Putting min_impurity_decrease = 0.1, we will obtain this:

How the tree looks when min_impurity_decrease = 0.1

Looking at the green square where gini index (impurity) = 0.2041, why was it not split when we put min_impurity_decrease = 0.1 although the the gini index (impurity) left = 0.0 and the gini index (impurity) right = 0.375

Does this mean to prune all the children node where, when pruned, their parent node gini index will become less than 0.1 ? Becuase, if this is the case, then why did we not prune the second level node having gini = 0.487), which is bigger than 0.1 ?

like image 974
Stev Allen Avatar asked Feb 21 '19 16:02

Stev Allen


1 Answers

Steve, this reply is late, but posting here in case others run across this problem and would like to know more about the min impurity decrease.

The min impurity decrease function formula can be found here. The formula is defined as:

N_t / N * (impurity - N_t_R / N_t * right_impurity
                - N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

Therefore, in your example:

N_t = 26
N = 90
N_t_R = 4
N_t_L = 22
impurity = 0.2041
right impurity = 0.375
left impurity = 0

I calculated the impurity decrease as 0.04, which does not meet the threshold you specified of 0.1. So in essence, this formula takes into account how much the parent node makes up of the total tree (N_t / N) and the weighted impurity decrease from the child nodes. If the final impurity decrease is less than the minimum impurity decrease parameter, then the split will not be performed.

like image 86
ThreeTrickPony Avatar answered Oct 31 '22 09:10

ThreeTrickPony