Is there a chance to get the value of information gain be negative?

<code>IG(Y|X) = H(Y) - H(Y|X) >= 0</code> , since <code>H(Y) >= H(Y|X)</code> worst case is that X and Y are independent, thus <code>H(Y|X)=H(Y)</code> Another way to think about it is that by observing the random variable X taking some value, we either gain no or some information about Y (you don't lose any). <hr> EDIT Let me clarify information gain in the context of decision trees (which actually I had in mind in the first place as I came from a machine learning background). Assume a classification problem where we are given a set of instances and labels (discrete classes). The idea of choosing which attribute to split by at each node of the tree, is to select the feature that splits the class attribute into the two purest possible groups of instances (i.e. lowest entropy). This is in turn equivalent to picking the feature with the highest information gain since <pre class="prettyprint"><code>InfoGain = entropyBeforeSplit - entropyAfterSplit </code></pre> where the entropy after the split is the sum of entropies of each branch weighted by the number of instances down that branch. Now there exist no possible split of class values that will generate a case with an even worse purity (higher entropy) than before splitting. Take this simple example of a binary classification problem. At a certain node we have 5 positive instances and 4 negative ones (total of 9). Therefore the entropy (before the split) is: <pre class="prettyprint"><code>H([4,5]) = -4/9*lg(4/9) -5/9*lg(5/9) = 0.99107606 </code></pre> Now lets consider some cases of splits. The best case scenario is that the current attribute splits the instances perfectly (i.e. one branch is all positive, the other all negative): <pre class="prettyprint"><code> [4+,5-] / \ H([4,0],[0,5]) = 4/9*( -4/4*lg(4/4) ) + 5/9*( -5/5*lg(5/5) ) / \ = 0 // zero entropy, perfect split [4+,0-] [0+,5-] </code></pre> then <pre class="prettyprint"><code>IG = H([4,5]) - H([4,0],[0,5]) = H([4,5]) // highest possible in this case </code></pre> Imagine that the second attribute is the worst case possible, where one of the branches created doesn't get any instances: rather all instances go down to the other branch (could happen if for example the attribute is constant across instances, thus useless): <pre class="prettyprint"><code> [4+,5-] / \ H([4,5],[0,0]) = 9/9 * H([4,5]) + 0 / \ = H([4,5]) // the entropy as before split [4+,5-] [0+,0-] </code></pre> and <pre class="prettyprint"><code>IG = H([4,5]) - H([4,5],[0,0]) = 0 // lowest possible in this case </code></pre> Now somewhere in between these two cases, you will see any number of cases like: <pre class="prettyprint"><code> [4+,5-] / \ H([3,2],[1,3]) = 5/9 * ( -3/5*lg(3/5) -2/5*lg(2/5) ) / \ + 4/9 * ( -1/4*lg(1/1) -3/4*lg(3/4) ) [3+,2-] [1+,3-] </code></pre> and <pre class="prettyprint"><code>IG = H([4,5]) - H([3,2],[1,3]) = [...] = 0.31331323 </code></pre> so no matter how you split those 9 instances, you always get a positive gain in information. I realize this is no mathematical proof (go to MathOverflow for that!), I just thought an actual example could help. (Note: All calculations according to Google)

Can the value of information gain be negative? [closed]

1 Answers

IG(Y|X) = H(Y) - H(Y|X) >= 0 , since H(Y) >= H(Y|X) worst case is that X and Y are independent, thus H(Y|X)=H(Y)

Another way to think about it is that by observing the random variable X taking some value, we either gain no or some information about Y (you don't lose any).

EDIT

Let me clarify information gain in the context of decision trees (which actually I had in mind in the first place as I came from a machine learning background).

Assume a classification problem where we are given a set of instances and labels (discrete classes).

The idea of choosing which attribute to split by at each node of the tree, is to select the feature that splits the class attribute into the two purest possible groups of instances (i.e. lowest entropy).

This is in turn equivalent to picking the feature with the highest information gain since

InfoGain = entropyBeforeSplit - entropyAfterSplit

where the entropy after the split is the sum of entropies of each branch weighted by the number of instances down that branch.

Now there exist no possible split of class values that will generate a case with an even worse purity (higher entropy) than before splitting.

Take this simple example of a binary classification problem. At a certain node we have 5 positive instances and 4 negative ones (total of 9). Therefore the entropy (before the split) is:

H([4,5]) = -4/9*lg(4/9) -5/9*lg(5/9) = 0.99107606

Now lets consider some cases of splits. The best case scenario is that the current attribute splits the instances perfectly (i.e. one branch is all positive, the other all negative):

    [4+,5-]
     /   \        H([4,0],[0,5]) =  4/9*( -4/4*lg(4/4) ) + 5/9*( -5/5*lg(5/5) )
    /     \                      =  0           // zero entropy, perfect split
[4+,0-]  [0+,5-]

then

IG = H([4,5]) - H([4,0],[0,5]) = H([4,5])       // highest possible in this case

Imagine that the second attribute is the worst case possible, where one of the branches created doesn't get any instances: rather all instances go down to the other branch (could happen if for example the attribute is constant across instances, thus useless):

    [4+,5-]
     /   \        H([4,5],[0,0]) =  9/9 * H([4,5]) + 0
    /     \                      =  H([4,5])    // the entropy as before split
[4+,5-]  [0+,0-]

and

IG = H([4,5]) - H([4,5],[0,0]) = 0              // lowest possible in this case

Now somewhere in between these two cases, you will see any number of cases like:

    [4+,5-]
     /   \        H([3,2],[1,3]) =  5/9 * ( -3/5*lg(3/5) -2/5*lg(2/5) )
    /     \                       + 4/9 * ( -1/4*lg(1/1) -3/4*lg(3/4) )
[3+,2-]  [1+,3-]

and

IG = H([4,5]) - H([3,2],[1,3]) = [...] = 0.31331323

so no matter how you split those 9 instances, you always get a positive gain in information. I realize this is no mathematical proof (go to MathOverflow for that!), I just thought an actual example could help.

(Note: All calculations according to Google)

answered Oct 13 '22 02:10

Amro

Related questions
                            
                                Why do we maximize variance during Principal Component Analysis?
                            
                                Proper way to feed time-series data to stateful LSTM?
                            
                                R: ggplot display all dates on x axis
                            
                                Difference between OpenAI Gym environments 'CartPole-v0' and 'CartPole-v1'
                            
                                how to split a dataset into training and validation set keeping ratio between classes?
                            
                                How to explore a decision tree built using scikit learn
                            
                                TensorFlow TypeError: Value passed to parameter input has DataType uint8 not in list of allowed values: float16, float32
                            
                                Keras + TensorFlow Realtime training chart
                            
                                Neural networks for email spam detection
                            
                                Cross Validation in Keras
                            
                                Naive Bayes vs. SVM for classifying text data
                            
                                ValueError: x and y must be the same size
                            
                                conversion of pandas dataframe to h2o frame efficiently
                            
                                Issue in training hidden markov model and usage for classification
                            
                                confusion matrix error "Classification metrics can't handle a mix of multilabel-indicator and multiclass targets"
                            
                                How to compute the cosine_similarity in pytorch for all rows in a matrix with respect to all rows in another matrix
                            
                                Finding K-nearest neighbors and its implementation
                            
                                Ensemble of different kinds of regressors using scikit-learn (or any other python framework)
                            
                                Download link for Ta Feng Grocery dataset [closed]
                            
                                SVM equations from e1071 R package?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can the value of information gain be negative? [closed]

Tags:

machine-learning

information-theory

julie

People also ask

1 Answers

Amro

Recent Activity

Donate For Us