Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are decision trees trying to maximize information gain or entropy?

I understand that decision trees try to put classifiers with high entropy high on the decision tree. However, how does information gain play into this?

Information gain is defined as:

InformationGain = EntropyBefore - EntropyAfter

Does a decision tree try to place classifiers with low information gain at the top of the tree? So entropy is always maximized and information gain is always minimized?

Sorry, I am just a bit confused. Thank you!

like image 512
Leopold Joy Avatar asked Dec 19 '13 07:12

Leopold Joy


People also ask

How does a decision tree work maximizes the information gain and minimizes the entropy?

For a decision tree that uses Information Gain, the algorithm chooses the attribute that provides the greatest Information Gain (this is also the attribute that causes the greatest reduction in entropy). Since B perfectly separates the classes, the entropy after splitting on B is 0 (i.e., the Information Gain is 1).

Is entropy used in decision tree?

As discussed above entropy helps us to build an appropriate decision tree for selecting the best splitter. Entropy can be defined as a measure of the purity of the sub split. Entropy always lies between 0 to 1. The entropy of any split can be calculated by this formula.

How are entropy and information related with decision trees?

The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).

How does decision tree help in information gain?

Information gain helps to determine the order of attributes in the nodes of a decision tree. The main node is referred to as the parent node, whereas sub-nodes are known as child nodes. We can use information gain to determine how good the splitting of nodes in a decision tree.


2 Answers

It is the opposite. For a decision tree that uses Information Gain, the algorithm chooses the attribute that provides the greatest Information Gain (this is also the attribute that causes the greatest reduction in entropy).

Consider a simple two-class problem where you have an equal number of training observations from classes C_1 and C_2. In this case, you are starting with an entropy of 1.0 (because the probability of randomly drawing either class from the sample is 0.5). Now consider attribute A, which has values A_1 and A_2. Suppose also that A_1 and A_2 both correspond to equal probabilities (0.5) for the two classes:

P(C_1|A_1) = 0.5
P(C_2|A_1) = 0.5
P(C_1|A_2) = 0.5
P(C_2|A_2) = 0.5

There is no change in overall entropy for this attribute and therefore, the Information Gain is 0. Now consider attribute B, which has values B_1 and B_2 and suppose B will perfectly separate the classes:

P(C_1|B_1) = 0
P(C_2|B_1) = 1
P(C_1|B_2) = 1
P(C_2|B_2) = 0

Since B perfectly separates the classes, the entropy after splitting on B is 0 (i.e., the Information Gain is 1). So for this example, you would select attribute B as the root node (and there would be no need to select additional attributes since the data are already perfectly classified by B).

Decision Tree algorithms are "greedy" in the sense that they always select the attribute the yields the greatest Information Gain for the current node (branch) being considered without later reconsidering the attribute after subsequent child branches are added. So to answer your second question: the decision tree algorithm tries to place attributes with greatest Information Gain near the base of the tree. Note that due to the algorithm's greedy behavior, the decision tree algorithm will not necessarily yield a tree that provides the maximum possible overall reduction in entropy.

like image 111
bogatron Avatar answered Sep 22 '22 11:09

bogatron


No, you are always placing nodes with high information gain at the top of the tree. But remember, this is a recursive algorithm.

If you have a table with (say) five attributes, then you must first calculate the information of each of those five attribute and choose the attribute with the highest information gain. At this point, you should think of your developing decision tree as having a root with that highest node, and children with sub-tables taken from the values of that attribute. E.g., if it is a binary attribute, you will have two children; each will have the four remaining attributes; one child will have all the elements corresponding to the selection attribute being true, the other all the elements corresponding to to false.

Now for each of those children nodes, you go through and select the attribute of highest information gain again, recursively until you cannot continue.

In this way, you have a tree which is always telling you to make a decision based on a variable that is expected to give you the most information while taking into account decisions you have already made.

like image 45
Novak Avatar answered Sep 18 '22 11:09

Novak