I understand that decision trees try to put classifiers with high entropy high on the decision tree. However, how does information gain play into this? Information gain is defined as: <pre class="prettyprint"><code>InformationGain = EntropyBefore - EntropyAfter </code></pre> Does a decision tree try to place classifiers with low information gain at the top of the tree? So entropy is always maximized and information gain is always minimized? Sorry, I am just a bit confused. Thank you!

It is the opposite. For a decision tree that uses Information Gain, the algorithm chooses the attribute that provides the greatest Information Gain (this is also the attribute that causes the greatest reduction in entropy). Consider a simple two-class problem where you have an equal number of training observations from classes C_1 and C_2. In this case, you are starting with an entropy of 1.0 (because the probability of randomly drawing either class from the sample is 0.5). Now consider attribute A, which has values A_1 and A_2. Suppose also that A_1 and A_2 both correspond to equal probabilities (0.5) for the two classes: <pre class="prettyprint"><code>P(C_1|A_1) = 0.5 P(C_2|A_1) = 0.5 P(C_1|A_2) = 0.5 P(C_2|A_2) = 0.5 </code></pre> There is no change in overall entropy for this attribute and therefore, the Information Gain is 0. Now consider attribute B, which has values B_1 and B_2 and suppose B will perfectly separate the classes: <pre class="prettyprint"><code>P(C_1|B_1) = 0 P(C_2|B_1) = 1 P(C_1|B_2) = 1 P(C_2|B_2) = 0 </code></pre> Since B perfectly separates the classes, the entropy after splitting on B is 0 (i.e., the Information Gain is 1). So for this example, you would select attribute B as the root node (and there would be no need to select additional attributes since the data are already perfectly classified by B). Decision Tree algorithms are "greedy" in the sense that they always select the attribute the yields the greatest Information Gain for the current node (branch) being considered without later reconsidering the attribute after subsequent child branches are added. So to answer your second question: the decision tree algorithm tries to place attributes with greatest Information Gain near the base of the tree. Note that due to the algorithm's greedy behavior, the decision tree algorithm will not necessarily yield a tree that provides the maximum possible overall reduction in entropy.

Are decision trees trying to maximize information gain or entropy?

Tags:

artificial-intelligence

decision-tree

entropy

information-theory

I understand that decision trees try to put classifiers with high entropy high on the decision tree. However, how does information gain play into this?

Information gain is defined as:

InformationGain = EntropyBefore - EntropyAfter

Does a decision tree try to place classifiers with low information gain at the top of the tree? So entropy is always maximized and information gain is always minimized?

Sorry, I am just a bit confused. Thank you!

512

asked Dec 19 '13 07:12

Leopold Joy

2 Answers

It is the opposite. For a decision tree that uses Information Gain, the algorithm chooses the attribute that provides the greatest Information Gain (this is also the attribute that causes the greatest reduction in entropy).

Consider a simple two-class problem where you have an equal number of training observations from classes C_1 and C_2. In this case, you are starting with an entropy of 1.0 (because the probability of randomly drawing either class from the sample is 0.5). Now consider attribute A, which has values A_1 and A_2. Suppose also that A_1 and A_2 both correspond to equal probabilities (0.5) for the two classes:

P(C_1|A_1) = 0.5
P(C_2|A_1) = 0.5
P(C_1|A_2) = 0.5
P(C_2|A_2) = 0.5

There is no change in overall entropy for this attribute and therefore, the Information Gain is 0. Now consider attribute B, which has values B_1 and B_2 and suppose B will perfectly separate the classes:

P(C_1|B_1) = 0
P(C_2|B_1) = 1
P(C_1|B_2) = 1
P(C_2|B_2) = 0

Since B perfectly separates the classes, the entropy after splitting on B is 0 (i.e., the Information Gain is 1). So for this example, you would select attribute B as the root node (and there would be no need to select additional attributes since the data are already perfectly classified by B).

Decision Tree algorithms are "greedy" in the sense that they always select the attribute the yields the greatest Information Gain for the current node (branch) being considered without later reconsidering the attribute after subsequent child branches are added. So to answer your second question: the decision tree algorithm tries to place attributes with greatest Information Gain near the base of the tree. Note that due to the algorithm's greedy behavior, the decision tree algorithm will not necessarily yield a tree that provides the maximum possible overall reduction in entropy.

111

answered Sep 22 '22 11:09

bogatron

No, you are always placing nodes with high information gain at the top of the tree. But remember, this is a recursive algorithm.

If you have a table with (say) five attributes, then you must first calculate the information of each of those five attribute and choose the attribute with the highest information gain. At this point, you should think of your developing decision tree as having a root with that highest node, and children with sub-tables taken from the values of that attribute. E.g., if it is a binary attribute, you will have two children; each will have the four remaining attributes; one child will have all the elements corresponding to the selection attribute being true, the other all the elements corresponding to to false.

Now for each of those children nodes, you go through and select the attribute of highest information gain again, recursively until you cannot continue.

In this way, you have a tree which is always telling you to make a decision based on a variable that is expected to give you the most information while taking into account decisions you have already made.

answered Sep 18 '22 11:09

Novak

Related questions
                            
                                Efficiency in C++ [closed]
                            
                                How to adapt my Minimax search tree to deal with no term based game?
                            
                                Virtual Testing Environment for Drones [closed]
                            
                                How is Growing Neural Gas used for clustering?
                            
                                How to create a new type of entity in Microsoft Robotics Studio 2.0?
                            
                                Anyone has implemented SMA* search algorithm?
                            
                                questions on clustering methods
                            
                                Is prolog not considered an artificial intelligence tool?
                            
                                What exactly are WordNet lexicographer files? Understanding how WordNet works
                            
                                Tensorflow Count Objects in Image [closed]
                            
                                Capturing game screenshots for use by a Python script
                            
                                Pytorch Unfold and Fold: How do I put this image tensor back together again?
                            
                                Document Layout Analysis for text extraction
                            
                                Placement of defensive structures in a game
                            
                                Extensible First Person Shooter in C++? [closed]
                            
                                Why is a bias neuron necessary for a backpropagating neural network that recognizes the XOR operator?
                            
                                Intelligent spell checking
                            
                                Interesting NLP/machine-learning style project -- analyzing privacy policies
                            
                                What is the difference between training function and learning function
                            
                                Small Expert System in prolog

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With