Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decision Tree Learning and Impurity

There are three ways to measure impurity:

Entropy

Gini Index

Classification Error

What are the differences and appropriate use cases for each method?

like image 643
Jony Avatar asked Feb 08 '11 18:02

Jony


People also ask

What is the impurity of a decision tree?

The node impurity is a measure of the homogeneity of the labels at the node. The current implementation provides two impurity measures for classification (Gini impurity and entropy) and one impurity measure for regression (variance). fi is the frequency of label i at a node and C is the number of unique labels.

How does a decision tree algorithm learn what are the measures of impurity used?

In the decision tree chart, each internal node has a decision rule that splits the data. Gini referred to as the Gini ratio, which measures the impurity of the node. You can say a node is pure when all of its records belong to the same class, such nodes known as the leaf node. Here, the resultant tree is unpruned.

How can we measure the impurity of features in decision tree?

One way to measure impurity degree is using entropy. The logarithm is base 2. Entropy of a pure table (consist of single class) is zero because the probability is 1 and log (1) = 0. Entropy reaches maximum value when all classes in the table have equal probability.

What is impurity in machine learning?

Gini Impurity is a measurement used to build Decision Trees to determine how the features of a dataset should split nodes to form the tree.


2 Answers

If the p_i's are very small, then doing multiplication on very small numbers (Gini index) can lead to rounding error. Because of that, it is better to add the logs (Entropy). Classification error, following your definition, provides a gross estimate since it uses the single largest p_i to compute its value.

like image 187
David Weiser Avatar answered Nov 07 '22 13:11

David Weiser


The difference between entropy and other impurity measures, and in fact often the difference between information theoretic approaches in machine learning and other approaches, is that entropy has been mathematically proven to capture the concept of 'information'. There are many classification theorems (theorems that prove a particular function or mathematical object is the only object that satisfies a set of criteria) for entropy measures that formalize philosophical arguments justifying their meaning as measures of 'information'.

Contrast this with other approaches (especially statistical methods) that are chosen not for their philosophical justification, but primarily for their empirical justification - that is they seem to perform well in experiments. The reason why they perform well is because they contain additional assumptions that may happen to hold at the time of the experiment.

In practical terms this means entropy measures (A) can't over-fit when used properly as they are free from any assumptions about the data, (B) are more likely to perform better than random because they generalize to any dataset but (C) the performance for specific datasets might not be as good as measures that adopt assumptions.

When deciding which measures to use in machine learning it often comes down to long-term vs short-term gains, and maintainability. Entropy measures often work long-term by (A) and (B), and if something goes wrong it's easier to track down and explain why (e.g. a bug with obtaining the training data). Other approaches, by (C), might give short-term gains, but if they stop working it can be very hard to distinguish, say a bug in infrastructure with a genuine change in the data where the assumptions no longer hold.

A classic example where models suddenly stopped working is the global financial crisis. Bankers where being given bonuses for short term gains, so they wrote statistical models that would perform well short term and largely ignored information theoretic models.

like image 4
samthebest Avatar answered Nov 07 '22 13:11

samthebest