Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problem understanding chi-squared feature selection

I've been having a problem understanding chi-squared feature selection. I have two classes, positive and negative, each containing different terms and term counts. I need to perform chi-squared feature selection to extract the most representative terms for each class. The problem is that I end up getting the EXACT same terms for both my positive and negative class. Here is my Python code for selecting features:

#!/usr/bin/python

# import the necessary libraries
import math

class ChiFeatureSelector:
    def __init__(self, extCorpus, lookupCorpus):
        # store the extraction corpus and lookup corpus
        self.extCorpus = extCorpus
        self.lookupCorpus = lookupCorpus

    def select(self, outPath):
            # dictionary of chi-squared scores
        scores = {}

        # loop over the words in the extraction corpus
        for w in self.extCorpus.getTerms():
            # build the chi-squared table
            n11 = float(self.extCorpus.getTermCount(w))
            n10 = float(self.lookupCorpus.getTermCount(w))
            n01 = float(self.extCorpus.getTotalDocs() - n11)
            n00 = float(self.lookupCorpus.getTotalDocs() - n10)

            # perform the chi-squared calculation and store
            # the score in the dictionary
            a = n11 + n10 + n01 + n00
            b = ((n11 * n00) - (n10 * n01)) ** 2
            c = (n11 + n01) * (n11 + n10) * (n10 + n00) * (n01 + n00)
            chi = (a * b) / c
            scores[w] = chi

        # sort the scores in descending order
        scores = sorted([(v, k) for (k, v) in scores.items()], reverse = True)
        i = 0

        for (v, k) in scores:
            print str(k) + " : " + str(v)
            i += 1

            if i == 10:
                break

And this is how I use the class (some code omitted for brevity sake, and yes, I have checked to ensure that the two corpuses do not contain the exact same data.

# perform positive ngram feature selection
print "positive:\n"
f = ChiFeatureSelector(posCorpus, negCorpus)
f.select(posOutputPath)

print "\nnegative:\n"
# perform negative ngram feature selection
f = ChiFeatureSelector(negCorpus, posCorpus)
f.select(negOutputPath)

I feel like the error is coming from when I calculate term/document table but I'm not sure. Perhaps I am not understanding something. Can someone point me in the right direction?

like image 898
Adrian Rosebrock Avatar asked Nov 06 '22 02:11

Adrian Rosebrock


1 Answers

In the two-class case, the chi-squared ranking of features is the same if the two data sets are exchanged. They are the features which differ the most between the two classes.

like image 69
Dave Avatar answered Nov 12 '22 17:11

Dave