Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can the value of information gain be negative? [closed]

Is there a chance to get the value of information gain be negative?

like image 657
julie Avatar asked Jul 20 '10 11:07

julie


People also ask

Can information gain be greater than 1?

Yes, it does have an upper bound, but not 1. The mutual information (in bits) is 1 when two parties (statistically) share one bit of information. However, they can share a arbitrary large data.

What is the limitation of using information gain?

The major drawbacks of using Information Gain as the criterion for determining which feature to be used as the root/next node is that it tends to use the feature that has more unique values.

How is information gain measured?

Information Gain is calculated for a split by subtracting the weighted entropies of each branch from the original entropy. When training a Decision Tree using these metrics, the best split is chosen by maximizing Information Gain.

Is higher or lower information gain better?

Information Gain, or IG for short, measures the reduction in entropy or surprise by splitting a dataset according to a given value of a random variable. A larger information gain suggests a lower entropy group or groups of samples, and hence less surprise.


1 Answers

IG(Y|X) = H(Y) - H(Y|X) >= 0 , since H(Y) >= H(Y|X) worst case is that X and Y are independent, thus H(Y|X)=H(Y)

Another way to think about it is that by observing the random variable X taking some value, we either gain no or some information about Y (you don't lose any).


EDIT

Let me clarify information gain in the context of decision trees (which actually I had in mind in the first place as I came from a machine learning background).

Assume a classification problem where we are given a set of instances and labels (discrete classes).

The idea of choosing which attribute to split by at each node of the tree, is to select the feature that splits the class attribute into the two purest possible groups of instances (i.e. lowest entropy).

This is in turn equivalent to picking the feature with the highest information gain since

InfoGain = entropyBeforeSplit - entropyAfterSplit

where the entropy after the split is the sum of entropies of each branch weighted by the number of instances down that branch.

Now there exist no possible split of class values that will generate a case with an even worse purity (higher entropy) than before splitting.

Take this simple example of a binary classification problem. At a certain node we have 5 positive instances and 4 negative ones (total of 9). Therefore the entropy (before the split) is:

H([4,5]) = -4/9*lg(4/9) -5/9*lg(5/9) = 0.99107606

Now lets consider some cases of splits. The best case scenario is that the current attribute splits the instances perfectly (i.e. one branch is all positive, the other all negative):

    [4+,5-]
     /   \        H([4,0],[0,5]) =  4/9*( -4/4*lg(4/4) ) + 5/9*( -5/5*lg(5/5) )
    /     \                      =  0           // zero entropy, perfect split
[4+,0-]  [0+,5-]

then

IG = H([4,5]) - H([4,0],[0,5]) = H([4,5])       // highest possible in this case

Imagine that the second attribute is the worst case possible, where one of the branches created doesn't get any instances: rather all instances go down to the other branch (could happen if for example the attribute is constant across instances, thus useless):

    [4+,5-]
     /   \        H([4,5],[0,0]) =  9/9 * H([4,5]) + 0
    /     \                      =  H([4,5])    // the entropy as before split
[4+,5-]  [0+,0-]

and

IG = H([4,5]) - H([4,5],[0,0]) = 0              // lowest possible in this case

Now somewhere in between these two cases, you will see any number of cases like:

    [4+,5-]
     /   \        H([3,2],[1,3]) =  5/9 * ( -3/5*lg(3/5) -2/5*lg(2/5) )
    /     \                       + 4/9 * ( -1/4*lg(1/1) -3/4*lg(3/4) )
[3+,2-]  [1+,3-]

and

IG = H([4,5]) - H([3,2],[1,3]) = [...] = 0.31331323

so no matter how you split those 9 instances, you always get a positive gain in information. I realize this is no mathematical proof (go to MathOverflow for that!), I just thought an actual example could help.

(Note: All calculations according to Google)

like image 52
Amro Avatar answered Oct 13 '22 02:10

Amro